Text classification permalink
Text classification is a famous topic inside NLP (Natural Language Processing), which in a broader sense divided into two categories,
Supervised classification permalink
To categorize a given text into a predefined set of categories. Say, you have incoming emails, that you want to group into 2 categories Social or Promotions. In this case, you will need to have training set for each category, build a model with that, then the classification engine will perform classification for any given text.
Unsupervised classification permalink
It does not need training dataset, instead groups given set of document into logical units. Unsupervised learning is a form clustering. K-Means is a popular algorithm here.
Lightweight supervised classification permalink
We are interested in the supervised classification in this article. Generally this requires decent amount of training data to give right prediction. I wanted to have classification for one of my projects related to email handling, most of the tools outhere required good amount of datasets which I don't have, but I had a small amount of accurate data. So, I built a lightweight classifier in Python, which takes a small training dataset, produces results based on the words.
This tool totally relies on the words presence and count the total occurance on the given input against the training set, and returns a list of categories.
Training file format permalink
__label__category1 training data
__label__category1 some other data
__label__category2 some data
Text followed by label is the category name, followed by a space then the input sentence. This format is chosen to be consistent with fasttext library, so that in case you want to move to that, the training file can remian the same.
Invoking the classifier permalink
results = classifier.classify("offer linkedin linkedin", "somerandomcategory")
results will be a list of tuple, like `[('category'1', 10), ('category2',5)]` sorted by top match first. 10,5 are the scores i.e number of word matches. "somerandomcategory" is the default category that you will receive in the event of no match!
This small library should work in both Python 2.x and 3.x, and has no dependencies.
This library prepares the training into a counted words, then compares that with the given input text, orders the result by word matches. For the above training data, it will have
when you give input of
Hello data, then it will return [('category1', 2),('category2',1)], since
data exists twice in category1 and once in category2.
Tools for Text Classification permalink
When you need more powerful classification, a few good options are,
https://www.oracle.com/technetwork/community/bookstore/taming-text-sample-523387.pdf has some basic details.
https://github.com/facebookresearch/fastText, written in Python, it claims that it runs faster. Our training set format is compatible with this tool.
https://aws.amazon.com/comprehend is a NLP service to find insights and releationships in text. The other two services we discussed here requires to setup on your own server while this one can be accessed using API and you pay as per your usage.