Having successfully divided my data set up into separate years yesterday, I thought I’d go back to basics and have a look at stopwords.
in language processing, it’s apparent that that are quite a few words that absolutely no value to a text. These are words like ‘a’, ‘all’, ‘with’ etc. NLTK (Natural Language Tool Kit – a module that can be used to process text in various ways. You can have a play with it here) has a list of 127 words that could be considered the most basic ones. Scikit-learn, which I’m using for some of the more complicated text processing algorithms) uses a list of 318…
Continue reading at: