1) I first downloaded my document (The Entire Works of Mark Twain) through Project Gutenberg's website as a text document. Save the document in a file on your computer.
2) Open RapidMiner and click "New Process". On the left hand pane of your screen, there should be a tab that says "Operators"- this is where you can search and find all of the operators for RapidMiner and its extensions. By searching the Operators tab for "read", you should get an output like this (you can double click on the images below to enlarge them):
There are multiple read operators depending on which file you have, and most of them work the same way. If you scroll down, there is a "Read Documents" operator. Select this operator and enter it into your Main Process window by dragging it. When you select the Read Documents operator in the Main Process window, you should see a file uploader in the right-hand pane.
4) Now we will move on to processing the document to get a list of its different words and their individual count. Search the Operators list for "Process Documents". Drag this operator the same way as you did for the "Read Documents" operator into the main pane.
Double click the Process Documents operator to get inside the operator. This is where we will link operators together to take the entire text document and split it down into its word components. This consists of several operators that can be chosen by going into the Operator pane and looking at the Text Processing folder. You should see several more folders such as "Tokenization", "Extraction", "Filtering", "Stemming", "Transformation", and "Utility". These are some of the descriptions of what you can do to your document. The first thing that you would want to do to your document is to tokenize it. Tokenization creates a "bag of words" that are contained in your document. This allows you to do further filtering on your document. Search for the "Tokenize" operator and drag it into the "Process Documents" process.
Connect the "doc" node of the process to the "doc" input node of the operator if it has not automatically connected already. Now we are ready to filter the bag of words. In "Filtering" folder under the "Text Processing" operator folder, you can see the various filtering methods that you can apply to your process. For this example, I want to filter certain words out of my document that don't really have any meaning to the document itself (such as the words a, and, the, as, of, etc.); therefore, I will drag the "Filter Stopwords (English)" into my process because my document is in English. Also, I want to filter out any remaining words that are less than three characters. Select "Filter Tokens by Length" and set your parameters as desired (in this case, I want my min number of characters to be 3, and my max number of characters to be an arbitrarily large number since I don't care about an upper bound). Connect the nodes of each subsequent operator accordingly as in the picture.
This is a really good tutorial. I was able to walk through the steps of this tutorial very easily. I was wondering why you chose to filter tokens of size three or smaller. This would eliminate any three letter words. Probably three letter words are probably very common and would be eliminated by the filter Stopwords function anyways. Maybe as a follow up you (or myself for that matter) could do another text processing tutorial that gets a little more in depth. I was thinking about taking a look at n-grams. N-grams are common word pairs of n length. For example, a 2-gram is a common pair of two words while a 3-gram is a common string of three words. I believe that this process would greatly help with the understanding of the data that you are mining. For example, let’s say that you are mining movie reviews with your current method. Right now you might get words like, “story”, “action”, or “jokes”. Now, if you were to generate 2-grams with the new process you might find you would get results like, “good story”, “bad action”, or “cheesy jokes”. This would give you a lot more insight into the data that you are mining. Also, it would be good if there was a better way to visualize this data. For example, if you were able to put the information you found about your data mining into some sort of graph or cart in order to more easily understand the data you are looking at.
ReplyDeleteHi,
ReplyDeletei gone through ur steps its good to learn, but i'm getting some noisy data. i need all the words as in the document which i added to Read Document.. and i want to know how to do this same word count for multiple files in a directory.
Thanks
Sridhar
Read Document and Process Document processes are not available on my processes list. Is there any way to import them or something else ?? What is the solution ??
ReplyDeleteThis comment has been removed by the author.
Deletesir can u please tell me how to perform machine learning algorithms on text files using rapid miner
Deletecould you please tell me with which operator in rapid miner can split a text into its sentences?
ReplyDeletethanks
Hi! I want get specific word from text file Like (company,increased ,price) and make relationship between them.
ReplyDeletecan rapidminer do that ? if anyone know please tell me ?
if you know then give any type of helping material :(
I have the same question. How can I feed a list of words into my analysis. I suppose what I'm trying to do is really the opposite of Filter Stopwords (Dictionary)
DeleteI have a hard time describing my on content, but I really felt I should here. Your article is really great. I like the way you wrote this information.
ReplyDeletecharacter count tool
Good useful details about text processing tutorials.
ReplyDeleteBig Data Analytics Services
Big Data Services
Thanks A lot for the easy to follow tutorial
ReplyDeleteReally it was an awesome article...very interesting to read..You have provided an nice article....Thanks for sharing..
ReplyDeletejava training in chennai
java course in chennai
Very useful and information content has been shared out here, Thanks for sharing it.
ReplyDeleteVisit Learn Digital Academy for more information on Digital marketing course in Bangalore.
Hiring web scarping services would be an unquestionable idea if you do not wish to get your hands dirty.
ReplyDeletescraper bot
data extraction services
web crawling services
web scraping services
ReplyDeleteData Mining software
Data Mining Service Providers in Bangalore
Thanks for providing very useful blog ! Amazing Content
ReplyDeletehttps://www.windowindia.net/word-files-splitter.html
Nice blog. very useful tutorial. I am new to Rapidminer and this helps.
ReplyDeletehttps://analyticsblog.ravivk.com
I am so proud of you and your efforts and work make me realize that anything can be done with patience and sincerity. Well I am here to say that your work has inspired me without a doubt. Here is i want to share about c sharp training with Free Bundle videos and c sharp training online .
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteGood and very interesting post thank you for the wonderful blog. If anyone want more leads from LinkedIn. We are here to give you the chance we provide LinkedIn Best Scraper for more leads get any data from our LinkedIn Sale Navigator Extractor
ReplyDeleteThis comment has been removed by the author.
ReplyDelete