Wednesday, April 3, 2013

Easy Way to Collect Twitter Posts

Let’s say that you a working on a project and wanted to gather the latest twitter post containing a certain word or sets of words. There is a simple web URL that you can use that will return this information.
Let’s say that I want to see all the latest post that contains the word “auburn”. I can paste the following URL in to any web browser and twitter will send me the last ten tweets containing the word “auburn.”

http://search.twitter.com/search.json?q=auburn&result_type=recent

What will return is posted below. Stripping this return of HTML can be done using RapidMiner or other software packages.



The last part of the URL (“&result_type=recent”) returns the most recent posts. If you remove this you will get a mix of the most recent post and some of the most popular tweets.

By default this will only give you the last 10 posts that twitter finds containing “auburn.” If you want to increase the amount of posts that are loaded you can add to the end of the URL. Let’s say that I want it to load the most recent 100 posts you would just add “&rpp=100” where the rpp represents results per page and the 100 is the number we wish to see.

http://search.twitter.com/search.json?q=auburn&result_type=recent&rpp=100

If you want to search for multiple terms at a time then you can do that two different ways. First you can add all the terms together in quotation marks (example: search for " auburn tigers" by using q="auburn tigers"). This method will automatically transform it to the second form: %22auburn%20tigers%22". The second form should be used when you are going to add further search parameters after the query text. 

The data that is returned contains a lot of information that might not be useful for your project but some of it is interesting.  The following is the results for the most recent singular twitter post containing “auburn”.

{"completed_in":0.056,"max_id":319540687655825409,"max_id_str":"319540687655825409","next_page":"?page=2&max_id=319540687655825409&q=auburn&rpp=1","page":1,"query":"auburn","refresh_url":"?since_id=319540687655825409&q=auburn","results":[{"created_at":"Wed, 03 Apr 2013 20:03:31 +0000","from_user":"M0******","from_user_id":21097****,"from_user_id_str":"21097****","from_user_name":"M*******M******","geo":null,"id":3195406876********,"id_str":"3195406876*********","iso_language_code":"en","metadata":{"result_type":"recent"},"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1790229258\/image_normal.jpg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/1790229258\/image_normal.jpg","source":"<a href="http:\/\/twitter.com\/download\/iphone">Twitter for iPhone<\/a>","text":"RT @AuburnUPC: We are excited to announce the Auburn Airwaves concert line-up!  We are presenting Train, Hot Chelle Rae, and Green River Ordinance."}],"results_per_page":1,"since_id":0,"since_id_str":"0"}

The first and last line of the results identifies what the search was looking for. Then the text describes the posts (some of the information was redacted for the purposes of this post for the privacy of the random user whose tweet I got for the example):

  • When the post was created: Wed, 03 Apr 2013 20:03:31
  • Who the user was: ","from_user":"M0******(partially redacted for this post)","from_user_id":21097****(partially redacted for this post),"from_user_id_str":"21097****(partially redacted for this post)","from_user_name":"M(redacted for this post) M(redacted for this post)
  • In this case she doesn't geo tag her tweets but if the user did you would see where they posted from: ","geo":null,
  •  Her unique twitter ID number: 319540687********* (partially redacted for this post)
  •  The URL link to the users profile picture: "},"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1790229258\/image_normal.jpg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/1790229258\/image_normal.jpg","
  • That the post was submitted using the users iPhone: Twitter for iPhone&lt
  • Then the text of the post: ":"RT @AuburnUPC: We are excited to announce the Auburn Airwaves concert line-up!  We are presenting Train, Hot Chelle Rae, and Green River Ordinance."


Final Thoughts: I don't use twitter but for those of you who do, please know that this data is retrievable without you permission by anybody with an internet contention. I choose to redact the private information of the twitter user whose tweet I used as an example but anyone with a web engine could have gotten it.  We can use this for big data because twitter makes this available so that developers can access the data free of charge. So twitter users beware!!

There are many more search operators that can be used to record tweets and narrow the search. They can be found here: https://dev.twitter.com/docs/using-search

Sources:

This post contains information from the website: http://nealcaren.web.unc.edu/pizza-twitter-and-apis/. A great resource for information on Big Data as it relates to sociology. It is written by a professor from UNC Chapel Hill who uses social media to conduct research and his website has a large number of tutorials on Python and API. 

No comments:

Post a Comment