Request Free Demo
Mobile#: +966547315697
Email: sales@bilytica.com
Text classification is a very important part in data management. We can easily classify text documents in to different categories based on category features. We can use different algorithms for text classification but today I will explain you easy text classification approach. By using this approach you canautomatically classify documents in different categories.
We follow four basic steps for text classification.
- Tokenization
- Stop words removal
- Normalization
- Similarity score
- Tokenization:
The first step of data classification is to tokenize the available sentence. We use tokenization because we have to process each word of the sentence.
For example we have text like this
We can tokenize text according to space between words.
- Stop words Removal:
Stop words are commonly used words in every sentence. When we classify data then first of all we must remove the stop words from text. We can easily get the list of stop words from Internet.
This is the example of stop words which we remove from text. We can remove stop words from text by comparing each word of text with the each word of stop words list and if the word match with the stop word then we remove the word from text.
After removal of stop words we find text like this.
- Normalization:
Normalization is the process of removing the special characters and replace capital letters to small letters. For this, we repeat the same previous process but this time we first get the list of characters except English text letters and this time we compare every letter with the special characters list and in matching case we remove the letter.
After normalization we get text like this.
- Similarity:
Now we can classify the text of any category. First of all we have to find the list of basic words used in each category. For example we classify the document in to category like science, mathematics or politics. First of all we have find the list of basic words used in each category.
After this we compare the updated text with the words of each category separately. After this we compute the similarity score of each category. The category which has highest similarity score we classify the text in to this category.