The other day I learned about the mathematics behind GPT-3 and related text generation AI's, and with all the crazy math I know from cryptography and encoding systems, I have been working on an algorithm that allows you to make a lot more spots to store a bit of data than just the mere 8 bits, in combination with dictionaries that contain the territory these special numbers refer, to in order to compress text in a way that the relationships between the units of the message are squashed into prime factor sets where the presence or absence of a factor functions as a memory location.
Unfortunately, there is no simple way to access modulo arithmetic on Go builtin unsigned integers, the implementions all relate to arbitrary length big integers, floating point numbers and 256 bit numbers.
The modulo product works, and when the product is bigger than the number of bits, it correctly gives you the value that it should.
But reversing this operation, it gets confusing because most texts and libraries assume the use of arbitrary length modular fields. So I have written some code that illustrates the vast number of interconnections between numbers within the mere 256 values of an 8 bit byte.
Unfortunately, the playground runner seems to not like how long that code takes to run, it still actually does pretty much complete and you can browse the text it creates with a text editor to really see the stuff.
The first segment shows the 54 prime numbers found inside the limit of 256 integers, and the interesting thing about prime factors is that only the prime and powers of the prime can alter the prime factors in a number by multiplying or dividing them such that the answer entirely falls off the end of the bits. The bits themselves are powers of 2, so every operation based on 2, and all the powers of 2, can have an effect on the number of factors, adding more or removing them using modulo inverse.
The second half of the output gives you a list of all of the products of multiplying every value between 0 and 255, and every other value, eliminates the repeating numbers that each number generates over those 256 values, and shows you the complete set of values that you find within each number in a square of all 8 bit values.
The current slight tangent in Indra development is mostly for training and education purposes, as the black art of working with finite binary integers for encoding, encryption, compression, and now, AI, has a lot of relevance to the work.
The big picture vision is the idea of building an engine a little bit like the GPT, except instead of the probabalistic encoding, which loses data - you can't rebuild the input data of the model out of the model, though the model, given the right parameters will generate the original text of the model.
The tantalising possibility is not just in the creation of an algorithm for compressing textual types of data to an extreme level. There will be a dictionary, and then a small number of extra bytes related to each node, which stores prime factors that by systematically adding and removing factors from these numbers, you can create several values that signify a temporal sequence of outbound paths to other symbols.
Thus, the dictionary entries become little nodes in a graph where each node has a sequential stack of references that decode to the value of the next item in the original document.
Since, as you can see just from the 8 bit field, from that little snippet of Go code, it is not hard to see that these noble prime factors, who cannot interfere with each other's presence in a number no matter what you do to anything that doesn't have the prime factor in it, can be set and unset independently. And probably for case of prime factors that have room to store exponential versions, squares, cubes, etc, these prime factors might be able to encode ternary and higher number encodings. The most aggressive compression algorithm could turn the series of primes, and the powers of each of the ones with the room to exponentiate become higher order bits. So you would have heximal, quinary, quaternary, ternary and binary in the lower factors that can instead be like three state bits, 4 state, 5 state, etc.
The bits can act as binary when there is one and you only care about one, but the smaller primes can be multiply added. So you can just make use of the 0th (first) power, which is just the prime itself, multiplied by a number containing 1, and then a modular inverse will unset the bit.
Since out of 8 bits, you have a possible maxmimum combination of 256, and the base of 8 bits, which in the olden days and inside embedded and kernel software, true false values for states and configuration data have long been used as a way to save memory, and the cost is an extra bitwise AND operation, to mask one bit field over another to discover if the bit is set or not. This is 8 times better than using the whole byte as a truth value like in C and Go and many languages, but it is more code, and a little extra overhead for the math required to read and write this way.
The teasing promise of this idea, is that as you see in that 8 bit factor list, there is 54 prime numbers, and because prime numbers can be added and removed independently of each other, it seems logical to me that what we have effectively done is created a way to access 54 bits of information, so there is a mathematical pattern that creates a mapping between the 54 prime factors and a decoded value.
It is not magic, it is just like the bit fields, which use exclusively powers of 2, which create an ordinal series of points that act like a bit field, but instead of only 8, there is 54. So, in an 8 bit value, there is 54 specific intermediary values 0-255 that can be independently set without affecting the other values, thus acting in the same way as the power of 2 bits in a bit field.
Prime Factors as Bits
To illustrate it as small numbers, let's consider 4 bits as a scaled down example to show the principle:
4 bits creates 16 possible values by combining the powers of 2. Using powers of 2 we can show that the following numbers in the 4 bit field can be independently added and removed using binary AND and one of the powers of 2 remapped to a series as such:
1, 2, 4, 8
If you want to set bit 1, you OR the number with 1. So, let's start with an arbitrary state that has no 1 in it and see it go in and out:
14 OR 1 = 15 15 AND 2 = 2 15 AND 4 = 4 15 AND 8 = 8
Alll of these statements are true, and thus you see that setting bit 1 using the OR operation we did not add or remove any of 2, 4 or 8.
Mathematically, the AND and OR operations used for this bitfield access is a modulo multiplication and division operation on the prime factor of 2.
As you go upwards through the set of positive integers, each individual number has an implied outbound graph connecting them to other numbers. When you multiply and divide numbers, you are adding and removing prime factors based on what number you are using. Multiplying and dividing by 12 is exactly the same as multiplying and dividing by powers of 2 and 3 in any order. There is 3 powers of 2, and 2 powers of 3 in 12:
Powers of 2: 2, 4, 8 Powers of 3: 3, 6, 9
After the smaller, more power containing low prime factors, the remainder of the field contains several powers with only the 1 version (The power of one is the multiplicative identity of a number). For the number 12, this adds 5, 7 and 11 as single power factors.
Unfortunately, 12 is not possible as a binary field modulo, but in a hybrid field with multiple -ary primes in it, we could encode it as two binary and one ternary, essentially 2, 2 and 3 bit value.
But instead, the nearest power of 2 we could then use as the container for this field, which is 16, or 4 bits, which we have explored some already. The 4 bits can contain one more prime factor, 13.
My intuition is that roughly half of the combinations of present factors can be used, and more than this will cause an error that loses a value due to overflow, which is where the product has a bit going to the next place in a binary field.
So how would that look?
So, in our 4 bit binary finite field, we have 16 values, and we have the following primes:
2, 3, 5, 7, 11, 13
Multiplying each one together until we pass 16 goes like this:
2, 6
Oh, we have fallen over the overflow and our number has looped back around, so what would the values be in an infinite field (enough for perfect precision) after this, and what would the modulo be instead?
First, the infinite field values of adding those 6 primes into a number using multiplication:
2, 6, 30, 210, 1470, 19110
I don't think it is necessary to illustrate that obviously you can divide any one of those values back out without changing that you can still divide out the other ones. But the question is, what would it be if the number is a finite field? Can we also do this isolated modification of the value to change the composition of factors in a fixed width value?
So first, what would these values be if we confine everything to our 4 bits:
2, 6, 14, 2, 6, 14
So, we can see that the factor of 7 blew up our number, and the first 3 factors only can be independently operated. So let's see this:
So it looks like we can only alter the 2 and 3 factors without blowing up the number. Thus we can see that our 2 and 3 factor can form 4 distinct values:
1, 2, 3, 6
If we try to add in other factors, we are going to lose one to gain another.
So it is not necessarily going to be a magical compression, perhaps? Or let's see what it looks like with more bits.
The series of primes for a field of 8, like the ones demonstrated in the little program above should follow the same basic rules. We should be able to concurrently add and subtract prime factors using modulo up to the prime series that produces (via multiplication) the same limits, in this case 128:
2, 3, 6, 30, 210
This is the series of values generated as you multiply a prime by the next prime until your product exceeds the limitation of the bits.
So, in that list above, I can eliminate any one of these values without eliminating the others, let's start from the number 210, which is 2, 3, 5, 7 primes:
210 / 2 = 105 210 / 3 = 70 210 / 5 = 42 210 / 7 = 30
Just to examine one for example, 105, does this still have the 3, 5 and 7 factors in it?
105 / 3 = 35 105 / 5 = 21 105 / 7 = 15
And to repeat again on the first one, which has 5 and 7 remaining::
35 / 5 = 7 35 / 7 = 5
And thus you can see that using this scheme you can subdivide the 8 bit value into a computable series of 4 bits.
All of this is related to the mathematics of coding systems, of which Hamming was the most prolific discoverer of the use of the web of interactions between numbers to unambiguously encode access to a smaller field inside a larger field for the purpose of creating ways to correct some proportion of errors in the transmission or storage of a blob of data.
At first you might think, big deal, ok, so you can generalise the powers of 2 for creating virtual bits that you can flip arbitrarily and not touch the other bits while doing so. The optimal bit capacity of the number is the number of bits, it might seem. But even though it looks, based on 4 and 8 modulo, that you get less capacity by using larger primes than 2, this is not where the compression magic is going to be, exactly.
Optimal Sequence Encoding of Dictionary Elements
What this is all in aid of is not so much doing funny integer math to achieve the virtualisation of bits of data that yields a lower capacity than just sticking to powers of 2, but as part of a scheme to encode sequence information associated with the entries in a dictionary.
Imagine a simple piece of text, let's use the classic typing class/alphabet lessions sentence:
the quick brown fox jumps over the lazy dog.
What are the dictionary entries for this series? What are the unique symbols in this? There is two instances of
the
and one non-word symbol .
.So, if we say that the first occurrance of each unique series of characters bound separated by space or symbols, and yes it could independently count both compounded symbols containing numbers and letters (letter first like in programming symbol names), as well as isolating those numbers from the letters and breaking them into a dictionary entry with a given number of instances in our original text.
So we can number the words of this sentence like this:
1 the 2 quick 3 brown 4 fox 5 jumps 6 over ~
Ok, I used ordinals there, in practise in the computer the zero value is used as the "first" and 1 is actually "second", and so on... The index value for "first"... a long time disagreement about whether arrays should use 0 or 1 as first index. It's stupid because 0 is not a counting number, it is the arithmetic identity for addition/subtraction. 0 is the first ordinal, and while ordinal and cardinal numbers are written the same way, in any given context your use of them is for one or the other case. Array indexes are ordinals, so they should start from "first" being zero. You can't add or subtract zero, it does nothing, thus it is the add/subtract identity.
It gets confusing because multiplication's identity is 1. So if you are doing multiply/divide math, zero is a little trapdoor into divide by zero and erasing all the factors in a number in one fell swoop with a multiply. Multiplying by zero is the same as dividing by infinity, there can be no second degree arithmetic operations (as you can define these two operations as iterative addidion and subtraction), in which the absence of the first bit of a number being on permits your formulas to hermetically operate on factors.
In previous examples I showed about how you can see the "adding" and "removing" of prime factors from a number field via multiplication and division, this operation only works if the number contains the multiplicative identity. It's not the first bit, necessarily, but if no other bit is set, it has to be set. Whether the first bit is set in a number is not the heuristic for determining if it can be modular inverted like the OR/AND read/write of individual bits, because 2 is a prime number. But we can know there is the multiplicative identity in the number by whether there is 1 and 2 set.
1 itself is always a multiplicative identity, and division with it has no effect. 2 will always shift that first bit to the second bit, and the order of the exponent is the same as the bit position (power of 0, multiplicative identity, the prime number itself).
We can tell if a number has this 2 factor present via the count of bits, interestingly. Here is our trusty 4 bit field, with the numbers having factors of 2 marked with a *
0000 0 0001 1 0010 1 * 2 0011 2 0100 1 * 4 0101 2 0110 2 * 6 0111 3 1000 1 * 8 1001 2 1010 2 * 10 1011 3 1100 2 * 12 1101 3 1110 2 * 14 1111 4
We can now see that our even numbers have either 1 or 2 bits, they don't have 3 or 4 bits. The pattern continues infinitely, let's just look at a brief snippet of the next 16 even numbers from a 5 bit field:
16 10000 1 18 10001 2 20 10100 2 22 10110 3 24 11000 2 26 11010 3 28 11100 3 30 11110 4
So, above 16, we start to see 3s and a 4. So, the point of this illustration was to show that even though our binary encoding is based on 2, the main way to determine if a number is even is to divide it by 2 and then multiply it by 2 and if the initial value is not the same as the product then it is not even as even numbers are all divisible by 2 with an exact integer result.
In the output of the program I linked to in the foregoing, some numbers have ergodicity over the field (256 possible distinct products from the multiplication) and others have half, 1/3, and various other proportions, in the second section I added a section breaker that let you see ascending sequences and break on a descending step, causes the whole chart to resemble an inverse square asymptote, and as you increase the value you increase the number of sequences that are in reverse.
Using prime factors encode a symbol series from a dictionary
Taking a block of data and decomposing it into repeating parts, and the higher order repeating parts made out of series of lower order parts, and then filtering this very large output (this is how all compression algorithms basically work), to eliminate the lower order repetitions (2s and such) in favour of a higher order, until you have a list of strings of symbols that captures the total unique data in the document, gives you an optimal dictionary size in which to then specify the order the entries should be copied into the output document, which requires you to encode a list of the transitions from one symbol to the next.
The generation of the dictionary, and making that dictionary free of any redundant bytes is the first step. The reason why compression and AI are related to each other here is both of them need to perform this step first before they start to do the things that are distinctive to those fields, and note that the great grandaddy of all of these is the language compiler, which uses similar techniques, its first step being tokenisation, a greedy accumulation of all sequences of 2 or more symbols that appear, to represent the entire gamma of symbols in the document, and then using lexical analyser, examines the short sequences of symbols to find common sequences, and encodes this as what is called an Abstract Syntax Tree.
Yes, you may have heard of a Merkle Abstract Syntax Tree MAST before, as several prominent bitcoiners are working on this.
MAST is a technique for creating abstraction in Bitcoin scripts so that one does not need to repeat fragments of the code, but can instead construct a transaction out of a number of previously published functions chained together. The chaining part is simplified by the fact the engine is a stack machine, so whatever a fragment leaves on a stack is there for the next step to interact with.
The magic of Merkle trees is also related to this. The way Merkle trees work is that, based on the fact that a piece of data can be verified by using its hash, which is a 1 to 1 relation to the data, and then hashing over top of this hash the hash of a parent or precedent element in a data collection, we know with 99.9999999% certainty that a (fundamentally linear) set of items are part of a set (the uncertainty is because there is usually other, usually longer sequences that can produce the same hash, a cryptographic hash's output is specifically designed so that collisions are unlikely between similar amounts of data).
So a merkle tree represents the final state of a process of sequentially hashing the hashes of a string of strings of bytes. Of course this is not compression, this is just integrity protection, tamper resistance, but at some level, you can say that this 256 bit value is actually a group of primes and their powers, and that given the right parameter, you could unroll the merkle tree into its original constituent hashes.
This parameter is the magic that has enchanted me in this exploration of prime fields. There is one, that is absolutely for certain.
As we have illustrated earlier, we know that a given modulo (bit width) has a capacity to store a finite number of prime factors within it, until an operation loses the bits to zero or overflow. So a given number of bits can hold a set of primes, and these primes can be added and removed using modulo arithmetic, and in our example for 4 bits, we have 2 primes we can add and remove easily.
So, given a dictionary size, and the number of times the entries are appended to the sequence to decompress to the original text, and each step can be encoded as a differential, so in a dictionary of 128 symbols, you know that a maximum distance of 64 lets you specify a next symbol, with the most naive, simple encoding, so your dictionary entries all have a series of 6 bit numbers after them, and you need just one, 7 bit value to specify the starting point, which has a list of its next symbols in the sequential order they occur, so first symbol you read the first path, which for a decompression could then have its first primes removed that specify this path.
For this case, with our 7 bit map, and our 6 bit path specifications, in its simple form, adds 6 bits of data for every instance of the dictionary entries in total, it is easy to see how for a super simple programming language, like Bitcoin script, it would make a lot more sense to enable the composition of the scripts where a section is already existing in the chain history, and then encode the compressed versions, and in our 128 entry dictionary example, each repeating section costs 6 bits.
But this is the simplest and least efficient way to encode a path between nodes in a graph that there is. You can derive a general rule, given a dictionary of
N
entries, that each vector costs as many bits as the N/2
, and so the dictionary size plus the number of vectors in the path is your compression ratio. Obviously, the more repeating sections, the smaller the dictionary, and thus the smaller the vectors, and the better the compression ratio.Compressing the Vectors
So, ideally, we would like to eliminate as much redundant information as possible from our path vectors at each symbol, and to achieve this we are going to make use of the prime factors compression we have discussed in the foregoing.
A rough figure for a typical dictionary size you might find in a body of text like, for example, the Bible or the entire edit history tree of Wikipedia, will be similar to the size of the medium-sized dictionaries, or around 30-80 thousand symbols.
The number of possible valid statements that can be constructed from such a dictionary is far less than the permutations that are possible given a size of the path vectors list, so the naive half bits per symbol instance is is likely wasting a lot of bits once you generate this array of vectors, you have another thing to make into a dictionary!
So the same process as breaking up the text into pieces is done again, looking for repeating sequences of steadily increasing order, and then a second stage of eliminating all smaller symbols that only appear in bigger symbols until we have the optimal dictionary for a given text.
This is the part of the idea that closely relates to the process of language model generation with language models like the Transformer.
There was a type of text input system devised some years ago that used dictionaries and the frequency distribution of letters to attempt to create a system for typing for disabled people. Most phones have predictive text option on the keyboard, and this largely works the same way except it operates at the level of symbols, and will generate word salad for you because it only distributes the words correctly, not their ordering.
When you take a document, and analyse the symbol sequences like we are describing here, you are implicitly capturing a sample of information that dictates the rules of a valid expression. With enough samples of valid text, you can assign a value to each dictionary entry which, when combined to the value on another entry, produces a value signifying a valid transition.
It's a state machine
It's important to understand what is being discussed here. The highest level, most simple category this relates to is that of a state machine. A state machine is a device that takes a string of inputs and based on its internal data and the inputs, it corresponds to one, and exactly one output.
You can relax that requirement and talk about a state machine that does not precisely replicate the input, but instead can decompose a query statement, capture its consist of symbols, and the syntactic and grammatical higher level structure, compress this into some small number of bits, and then use it to systematically search the dictionary, and voila, you have a transformer style language model.
The only difference between the transformer and what we want to achive here is losslessness. We want to have an exact 1:1 transformation between the dictionary/path format and the original document, such that the former can be used to generate the latter.
Firstly, one parameter that is required is the first symbol in the document. This is the seed of the document. At each dictionary entry we come to as we unwrap the next symbol reference out of the entry's key, we have these lists of references, and each time we arrive at a node we pick the next one.
Dictionaries contain, first, the unique symbols, that have one or more appearance in the text. Then, using the raw sequence of symbols, we can then perform the same entropy elimination on the sequence to arrive at a dictionary of phrases and possibly sentences.
The process requires depth first walking of tree data structures, depth first because it's greedy and wants to find the biggest possible reduction of size for a given text, so it will find a symbol, and then if it finds the symbol's next symbol is a pattern, then that is now the meta-symbol we are referring to. So the dictionary has 2 tables, one is the list of distinct, and optimal symbols found in the document, and the second table is a collection of sequences of the symbols that repeat.
The compression uses this two layer abstraction to achieve optimal elimination of redundant information, and these symbol sequence patterns not only concretely encode our document text, they also contain grammatical information. Consider a simple and common construction:
The <x> of <y>
So we have first, the symbols, second, the symbol sequences, and then here we see there is another pot of gold of information. The search strategy for this is different to the previous two. In this case, we would need a new type of symbol entry, which I will call a template, as it is used to capture patterns of symbols in which two or more of the symbols are found in exact relative position, like the
the
and of
in the above example.Grammar
So the third step in the compression process would be in analysing these second order symbol sequences to find symbols that contain 2 or more common symbols in the exact same position. To implement this we need to generate a new set of third order symbols that represent the placeholders of the non-matching symbols in the sequences with 2 or more matching symbols.
The symbols are thus placeholders for parts of speech.
So, let's say we have gone through the second order table and found a series of third order patterns with partial matches. We have now tagged our part of speech symbols with a new unique code that associates them with every instance of words in the empty spots in our third order sequences.
So, for the AI text mangler case, you can see that this third order of structure in the text, first the words, then the phrases, and now we are abstracting the phrases. This still has a compression use case use as well.
With these newly generated mess of parts of speech symbols that are tied to template sequences, we have just created a situation where the template symbol, plus the number of distinct variations of the part of speech words is finite in nature for our document and thus rather than store the concrete sequences of each path individually, can now define a series of parameters representing the valid combinations of concrete words for the part of speech.
So now in our list of instructions for how to reconstitute our document, we can eliminate the different concrete phrases that have partial matches, and now instead have the template plus the symbols that fit in the slots of it. This obviously will save many bits in storage, as the valid set of words at a given part of speech is a definite subset of the rest of the words.
So far, we can see that for the majority of this, the compression algorithms and text manglers have all the same things in them. Here, now, we see that even the capturing of grammar information of the text is potentially in common.
Depth and Grammar
In high performance computing, depth tends to be a thing that you want to avoid as much as possible, because it requires so much memory access. But in this case, we are going to save a lot of time later because we will have the abliity to turn text into parameters to access the dictionary, so we can afford to spend more time and space on this to get every bit we need, at the lowest cost.
So the searches are exhaustive, and the template step we end up with a collection of templates, each associated to a list of words as they sit in each slot, as a distinct new "symbol" just as the repeating sequences and the templates themselves are new symbols in our dictionary, we now have these special symbols referring to the list of concrete symbols that appear in them.
These groups of words will have a partial overlap between the many found template patterns, and so we can go to a 4th order of abstraction, which essentially creates symbols that correspond precisely to parts of speech.
To do this, we take the list of template slot symbols, and search for the largest common collections of symbols found in them, and progressively the smaller and smaller ones until we exhaust all the minimal 2 items in common.
The 4th order symbol values thus represent, in the strictest possible way, the grammar of the document.
Genre and Intent
Having shown now how using progressive decomposition of a body of text we can derive symbols, concrete sequences, abstract sequences, and now, categorised every symbol into a bucket that we can consider to be the grammar, there is two more things stored in a document that we can analyse further through other methods.
By genre, I mean, the general nature of a body of text. A text can be a question or an answer, for example. A text can be an exposition (like this), an elaboration starting from something simple and building a complex model. A text can be a warning, a prediction, a puzzle, misdirection.
This is where we are now going to an even deeper level, this is the 5th order, where we are going to create new symbols to anchor two thematic relations. In the document, the words in the sections are a definite set, and comparing sections sets we can trace the appearance of parts of speech and see them grouping together by their proximity. So once we have created a new symbol representing the context of a given section, tied to our various parts of speech, we can see that for starters, all of the nouns will distribute into a map of associations where their proximity to each other indicates a thematic relation.
In this process, we are creating new categories of symbols. We can say you can treat an individual cipher and a specific sequence of ciphers, and even, a template sequence of ciphers in the same way. As the process of decomposition proceeds, more bits of information in the higher order symbol path codes is compressed into a smaller sequence.
The genre of the text is a bit more fuzzy, as it depends on the context of the scope you are considering it within. One of the patterns that will be right at the top level of the templates is sentence and paragraph structure. So, in the template section of our dictionary, we see we are defining the forms also for the sentences, and these then become new high level symbols, that are tied to the templates and the concrete sequences. Such information easily defines the difference between the various types of documents that exist.
With the scopes created by these high order structures of sentence, paragraph, chapter and book, we now have a boundary for our searches for thematic information. Each sentence, paragraph, chapter and book has some set of words of the parts of speech that create subjects and themes. These associations are only intelligible via their differences to each other, giving them names is long after the point of being able to recognise them.
Thus, at this step in the process, with all the templates defined and added to the template symbol table and its links to the template fields, we can now start to look at the co-occurrance of many words together as regards to their proximity to the various templates they are found in.
Each of the section scopes is represented by a symbol, and we can now create a map of paths between these scope symbols which will encode form, genre and style information. Words that occur frequently together within a sentence scope, paragraph scope, chapter or book scope then are grouped, and then all of the groups are decomposed for eliminating redundancy and achieving an optimal set of new symbols to represent it.
It is not the compression use case in play when talking about the genre and form information of the document. This is for search and the graph of relations created in the decomposition, and it is this exact level of the technology that the currently popular text generators have achieved, the appearance of understanding of theme and form.
The theme, genre and form symbols are the high level structuring of the document, the most salient features when first examined. Just like the repeating patterns lets you find parts of speech, the repeating patterns of associations with some given set of words and the patterns of exposition of the text allow you to define the difference between different documents by their intent and format. Essay, novel, romance, science fiction, poetry, each similar structured section will be recognised and able to be used as a selection criteria when filtering the text.
These high level structures require a lot more source material to fully map.
How this data can be used
So, after all is done and we have taken our little library of texts, decomposed them into dictionaries, grammars and genres, what can we do with this data?
The AI use case basically uses exactly the complete set of these 5 orders of structure, and its purpose is to search for various things as well as enable the generation of correct texts from prompts that embody the search parameters that are derived and then used to determine the path of the response.
The compression use case does not need to have grouping of words by intent for smaller texts, but whith enough text together, the form/genre/intent symbol tables will allow the elimination of bits of data, in proportion with the number of times this pattern is found repeating in the text.
So actually, all of it is relevant to both use cases.
A model built from a very large amount of source text is going to be able to compress data a lot more effectively than a smaller one with less variations and less repetition. Really, the ultimate would be to make a model of every text in existence that can be found at a given time, and with this dictionary in your possession you can probably encode your whole text as a few symbols, at some point the amount of texts and the full gamut of variations will cover such a large amount of data that even quite large texts can then be turned into a short series of mostly 3rd and higher order symbols.
The Prime Factor Compression
The dictionary is essentially a type of encoding scheme that is a One Time Pad. If you have the dictionary, someone can send you a message that might only be a few short bytes, and it's a whole essay of 10,000 words. Different to the encryption use case, but analogous, the compressed message form is like a public key, and the dictionary is the private key. Obviously, encrypt the message, but the content is found by using data that corresponds to a concrete sequence of symbols from our dictionary and its 5 layers of structure.
So, the fundamental structure of the decoding path essentially is a sequential series of references from one dictionary entry to another. You can then decode the entire document by following these references to each new symbol. So our data structure requires values that associate to the symbol keys, and encode the required series of symbol references that represents the next symbol after a given symbol, and do it in a compact way that doesn't make our dictionary bigger than our source text.
So, going back to the prime factors discussed earlier, we want to elaborate a little more upon how we are going to compress this symbol table reference data as flat as we possibly can.
We would not be dealing with tiny little numbers like the 8 bit fields discussed earlier, this would use the same principles but with much larger prime sets. The codes we are encoding are references to dictionary entries, and their size relates to the depth to which the text was analysed. But they are sorted by frequency so the lower numbered entries are the most often needed, and using a variable length encoding, we can minimise the space of a given vector.
Next, we can take these small vectors and use their prime factors to put them into containers of sequence.
If our dictionary has, given all the foregoing desccriptions of symbol tables, somewhere around 100-500k of distinct symbols, then we can probably store every explicit reference as around 20 bits in size, let's say 19 bits representing a field that can store a reference to any other position in the field.
Rather than being a sequential list of these 19 bit values in the prime factor of 2 base encoding, we can collapse these references once we have generated them.
If we have, say, the symbol for the word "the" in our dictionary, it is easy to see that it is going to have a very very large number of vectors, one for each appearance in the text. These most common words, so short, and yet, to make a map of where they are in the document such a long amount of data.
As you would have seen in the output of the program from the beginning of this article, you would have seen in the second half all of the sequences of 8 bit values, these sequences, their position, and the length of the occurrance form a basis for some of the compressibility of the data.
The sequence of references also is pliable to the same entropy elimination process as the symbols and the dictionary itself. So we can eliminate the repeating patterns in a symbol's vectors, creating new special kinds of vectors that encode repeating sequences, by expanding the bits a little from the base uncompressed vectors to provide more space for these second and higher order symbols.
We can compare the factors in the sequence of vectors to each other also, and potentially find more compact ways to mutate one into the next. This would also, again, create more symbols that can be unpacked to represent a concrete vector series. So, yes, even the vectors, references to dictionary entries, a memory map, will also be a symbol table, with some extra bits to enable the specification of patterns with low bit wide parameterisation that is smaller than the explicit list of the references.
The end result
So, assuming we have implemented all of this, and we achieve a nice ratio of 10% overhead above the close to optimal dictionary size, we have created empirically verifiable maps of symbols that map 1:1 with grammar rules, we have a compression format for any type of data, really, but most interesting when applied to text.
A language model that is sufficiently detailed to correctly parse english could be trained also with one or several computer programming languages, and you would have a software scaffolding engine that could eliminate all of the repetitive parts of building software where the parts are well established, bug free and reliable, the AI can take that data and recompose it for you to put your application into.
The lossless scheme of compression outlined in this document is a data set that will be usable for the purpose of AI text mangling, as well as a kind of codex to enable massively increased compression of data in a text format, your messages will only contain precisely enough information to specify the correct path in the dictionary, which will mean your message is a tiny collection of these mostly abstract symbol keys that associate to concrete numbers.
Software applications can be distributed as the compressed symbol path specification that precisely generates the complete text of the program's source code. Updates can be the difference against the original, inserting and deleting and mutating the path to the new version's text.
Computer programming languages will become subordinate to this system, and using one of these dictionary/generator maps, you would be able to train it on your desired base language implementation, and then rapidly mutate the grammar, constants, restructure the standard library, and all kinds of things like this, using similar techniques as done with industrial design, by creating a genetic algorithm that makes changes, and then applies a series of tests to evaluate and rank the result.
Operating systems themselves will be rewritten. Because of the accelerated access to information this technology is bringing, experiments that were too marginal to be economically viable would be possible and within a few years I expect to see a swiss-army-knife tool that literally hacks, finds all bugs, trojans and backdoors in a computer system, and applies all known techniques to essentially completely bespoke customise your computer and software to your exact requirements, and your refinements also available to share on the internet and accelerate the improvements.
Simple compression is just aimed at saving money on storage. This kind of high level semantic aware compression is about saving time on turning a search expression into an intelligible and relevant answer based on the knowledge in the text.
Concluding rambles
I have been rather naughty and taken quite a little leisurely stroll into the zone of prime factors, modulo arithmetic, AI text manglers and compression algorithms, from my primary mission of building Indranet. I got a little bit over excited about the idea of playing with these prime factors for compression, but after some little bits of work I have mostly separated the useful from the not useful things on this subject, and in the future this work will see more development.
My current task on Indranet is finishing up the dynamically adjusting, secure and anonymising network packet dispatch system led me into some difficult concurrency puzzles, and a few days ago I got the encryption key change protocol on it working correctly.
I was way too excited about the primes. But then after that, the text manglers, being all the rage at the moment, made me realise that at some point in the future they are going to be replacing computer language compilers, and work like I am doing right now will become a lot easier for the parts where I am not really inventing anything new but just implementing something standard with a thin set of variations to it.
The reason why they will become the main programming tool is because they essentially are just like compilers. They tokenise text, generate syntax and grammar graphs, structure graphs, and then from this intermediate encoding spit out machine code that will implement it.
One of the most time consuming tasks for programmers, and how I come to get restless doing the coding at certain points, is refactoring. Bad names can make code hard to fix and extend. Badly laid out files in the repository can make it hard to find something at all.
AI's are not going to take our jobs. They are going to make it so that a lot more people can write, modify and extend software, eliminating all of the dumb stuff like bulk renames and restructures, without introducing errors or messing up your comments. The comments could even be automatically generated based on your starting points and the code around it.
This post is quite a rambling stream of consciousness, and you can see me discarding the ideas about data compression and elaborating each detail of how the same things are going on inside text manglers and where the technology is probably headed.
The thing that gets me excited the most about the AI tech is the possibility of being able to completely tailor my entire computing environment to my own, down to the last detail, and to be able to take new ideas and add them to the pile and with Indra and the whole web5 p2p thing coming, everyone sharing and riffing and refining, further accelerating the process of eliminating error from our lives.
A lot of people worry about the technology but when guys like me get through the development process and start to release what we have built with and inspired by these new technologies everyone will see. This will be the end of software conglomerates. Microsoft, Apple. Irrelevant.