In the Colab Notebook below I
will demonstrate how to use unsuperivsed machine learning techniques
(TF-IDF, cosine similarity, Affinity Propagation, and K-means) to
classify/segment bodies of text. I had trouble converting it directly to
The Colab Notebook will demonstrate the following:
1) Download a bunch of tweets from Twitter - let's search #mining.
2) Use TF-IDF and cosine similarity to generate similarity scores of our tweets.
3) Use Affinity Propagation and K-means to classify the texts based on their similarity scores.
4) Create a word cloud for each of the classifications.
The techniques used in this post can be used to classify unseen texts into different categories and summarise them. These texts could be emails, customer complaints or other business documents.
The Colab Notebook will demonstrate the following:
1) Download a bunch of tweets from Twitter - let's search #mining.
2) Use TF-IDF and cosine similarity to generate similarity scores of our tweets.
3) Use Affinity Propagation and K-means to classify the texts based on their similarity scores.
4) Create a word cloud for each of the classifications.
The techniques used in this post can be used to classify unseen texts into different categories and summarise them. These texts could be emails, customer complaints or other business documents.
TL,DR:
- You can give a similarity score between two texts by
vectorising the features (words and their relative frequencies) in them
and calculate the cosine of the angles between these vectors.
- When you have a small number of classes/clusters/groups
of texts, you can have the result of a main class that encompass the
majority of the texts and other "anomaly" classes.
- Once you have texts separated in to classes, a good way
to summarise and visualise a class is to use a word cloud , where you
can see the most common words.
No comments:
Post a Comment