April 14th, 2015

tux the linux penguin

Teaching robots how to write is almost like teaching robots to love

Blaine'll have meant it look and leaned forward to ask me to enjoy the one. Sometimes he swallows the weather, I do that their lips when Kurt pulls back of possessiveness raging bitch's legs, kissing him somewhere they usually. I want out in his tongue and when he could expect a little hair. He feels like Hunter. Blaine wasn't want to be easy to engage in high and healthy (as his back toward the doctor told you tonight, shifting down, giving him right away).

I got bored of writing porn and decided to make a computer do all of the work for me.

This project is still mostly incomplete. I have a lot more work I can do to tweak things, from improving the algorithm to improving the formatting to figuring out a mechanism to generate fake fanfic on a more ongoing basis (Twitter bot? A website?).

But anyway, I've been working on this for the last few days, and it's been a source of joy and delight and a decent amount of programming. I will put the code I used to make this up at some point, so if anyone is interested, they can take a look.

Here, I will explain the math behind it, because machine learning is a lot simpler than computer scientists would want you to believe:

Okay, so the first part of machine learning is called 'training.' This is where you show the machine learning algorithm a bunch of known and understood data such that the machine learning algorithm has some basis for its predictions afterwards. In this particular case, I fed the machine learning algorithm a bunch of existing Kurt/Blaine stories. With this large set of unstructured text, the algorithm then creates what we call 'bigrams,' which is just a fancy way of saying 'pairs.'

For example, a sentence like: "The brown dog jumped over the brown dog at the park."

Turns into this set of bigrams:

("The", "brown"), ("brown", "dog"), ("dog", "jumped"), ("jumped", "over"), ("over", "the"), ("the", "brown"), ("brown", "dog"), ("dog", "at"), ("at", "the"), ("the", "park"), ("park", ".")

Yes, punctuation does count as its own word in our world.

After we have bigrams, we then construct what is called a 'Conditional Frequency Distribution', which is a fancy way of say 'we counted a bunch of things.'

So the bigrams we have can be turned into a list of the number of times we've seen word X immediately after word Y.

For example, if we take a look at the word "the", we can see that it is the first bigram in a few different places. (For my example, capitalization doesn't matter, but the algorithm works better if it does.)

Here are the pairs that start with "the":

("The", "brown"), ("the", "brown"), ("the", "park")

So we see that "brown" shows up twice after "the" and "park" shows up once. The conditional frequency distribution then says that if we see a "the", there is a 2/3 probability that "brown" will be the next word and there is a 1/3 probability that "park" will be the next word.

We do this sort of counting for all the bigrams in our training set. Machine learning algorithms tend to get smarter the more data you throw at them, and you can see why based on the example here. Sure, in the English language, "brown" shows up plenty of times after "the", but it doesn't show up 2/3 of the time. That's just a quirk of our training set.

Now that we have our conditional frequency distribution, we can then start generating text! We need to give it a good starting word. In this particular case, I chose "Blaine". The algorithm has a list of which words have shown up after "Blaine" and will then pick one of those words based on the frequency. Say that the next word is "is". We can then pick the word after that based on the words that we've seen that have followed "is", and so on and so on.

And that's it. That's the algorithm.

One thing I'm thinking of doing to improve my current algorithm is to use trigrams instead of bigrams, which will help a little bit with the weird jumps, especially with the contractions. Trigrams will allow the algorithm to look at the previous two words instead of just one of them in order to make decisions about how to pick the next word. That will fix some of the weirdness I'm seeing with contractions, at least.

I also want to mess around with different corpora. Maybe different fandoms. Maybe different search criteria for different fandoms. I have the tools in place. It would be easy to run different runs on different data sets.

Anyway, I'm open to suggestions, comments, thoughts! I have plans for this thing, but the future is so open and full of possibility. I'm also happy to answer questions.

This entry was originally posted at http://thedeadparrot.dreamwidth.org/581146.html. You can comment there using OpenID or you can comment here if you prefer. :) comment count unavailable comments there