Data Econ: How to Classify and Summarise Unseen Texts Using Machine Learning

Monday, 15 July 2019

How to Classify and Summarise Unseen Texts Using Machine Learning

In the Colab Notebook below I will demonstrate how to use unsuperivsed machine learning techniques (TF-IDF, cosine similarity, Affinity Propagation, and K-means) to classify/segment bodies of text. I had trouble converting it directly to

The Colab Notebook will demonstrate the following:
1) Download a bunch of tweets from Twitter - let's search #mining.
2) Use TF-IDF and cosine similarity to generate similarity scores of our tweets.
3) Use Affinity Propagation and K-means to classify the texts based on their similarity scores.
4) Create a word cloud for each of the classifications.
The techniques used in this post can be used to classify unseen texts into different categories and summarise them. These texts could be emails, customer complaints or other business documents.

TL,DR:

You can give a similarity score between two texts by vectorising the features (words and their relative frequencies) in them and calculate the cosine of the angles between these vectors.
When you have a small number of classes/clusters/groups of texts, you can have the result of a main class that encompass the majority of the texts and other "anomaly" classes.
Once you have texts separated in to classes, a good way to summarise and visualise a class is to use a word cloud , where you can see the most common words.

Data Econ

Monday, 15 July 2019

How to Classify and Summarise Unseen Texts Using Machine Learning

TL,DR:

No comments:

Post a Comment

Portfolio Optimisation with Python