Most people always assume that making a machine to generate text is a difficult thing, and with the advent of ChatGPT, that assumption has only become stronger. Everyone is amazed at the staggering number of parameters—7 billion, 25 billion, 175 billion. Personally, I think we give too much credit to massive language models. Don’t get me wrong, talented scientists have designed and constructed incredibly powerful language models, but you don’t need millions of dollars worth of time and resources to build a text generator. Honestly, pretty much every machine-learning algorithm can be used to generate text.
Over the past few months, I’ve been working on-and-off on a project to test this theory. Specifically, I’ve been working through some rather simple machine learning methods and figuring out ways to make them generate text. “Simple machine learning” may sound like any oxymoron, but there are very simple machine learning methods that you already know. One such example is linear regressions. The main idea is very straightforward. If you have a bunch of points scattered on a graph, draw a line that represents those points. In cases like this, you don’t even need any math to find simple trends. Just take your pen and draw a straight line roughly through the points.
Linear regressions are clearly simple, but using this to generate text is another story. Luckily, linear regressions extends to multiple input variables and output variables. By turning each word into a 0-1 variable, we can predict each subsequent word. For example, starting with the word “once,” your model might predict the word “upon.” From there, you just make “upon” the input, and then you can generate the next word, which might be “a.” Continuing this process, you’ve generated text with just a linear regression.
Linear regressions weren’t much of a problem for me since I had already built a text generator using Markov chains, so I was already familiar with transition matrices. KNNs however, were a different story. KNN, or k-nearest neighbors, is a method that takes an input, compares it to data points that are very similar to it, and uses their labels to inform the output. In theory, I could just take each word, analyze what typically comes after it, and use that, but that doesn’t make for a particularly interesting text generator. At that point, it’s no different than a Markov chain.
Since I wanted to embrace the true spirit of kNNs and prove they could generate text, I began looking for a different approach. After exploring various mathematical approaches, I realized I could use character-wise prediction instead of token-wise prediction. Instead of predicting an entire word, I would just predict one letter at a time. This had the potential to fail miserably, but it would be phenomenal if it succeeded. By using character-wise prediction, I could embrace the distance metric at the heart of kNNs, but rather tha using standard Euclidean (straight-line) distance, I could use Hamming distance. By re-centering a binary vector space around the input word, I could position all the other words around it using exclusive-or to assign binary values based on whether or not the characters matched my input. From there, I just needed the vectors with the smallest Hamming norm, and I was good to go.
If that all sounded like made-up, math word soup, then I’ll just have to ask you to trust me. The project is still ongoing, but once it’s at a point that I feel comfortable abandoning it, I’ll try to make things clearer, probably with a few examples. For now, things seem to be working as expected. Text generation isn’t hard because putting one word after the other has never been hard. The true challenge is in generating meaningful text. Generating meaningful text requires understanding truth, observable reality, intent, and so much more that we can’t properly encode the way we encode language. All of the meaning that comes from large language models is only present because we taught it with meaningful text, and we inject meaning back in when we read the output. Generating text is easy, but that doesn’t bring us any closer to generating meaning.