Google’s BigQuery gives users the ability to merge multiple
large data sets using a common key, Big JOIN.
“Big JOIN simplifies data analysis that would otherwise
require a data transformation step, by allowing users to specify JOIN
operations using SQL,” states Michael Manoochehri, Developer Programs Engineer,
Cloud Platform.
The article uses an example of web applications producing
millions of lines of user activity log code in a few hours. Normally, the
breaking up of all this code would be cumbersome, but Big JOIN eliminates that.
Great for e-commerce sites.
Another update from popular request was answered in this
update. It provides a native TIMESTAMP data type for importing of data and time
values while keeping time zone information.
Much like Google Docs, you can now share your data sets
using BigQuery’s Web UI via email notifications.
“Data analysis with the BigQuery API should be as simple as
possible, and that is in keeping with Google’s philosophy overall on many
things.”
Manoocherhri notes that they still need to work on BigQuery’s
public documentation. However, they have been working on it and plan to release
several tools soon to make it more ‘wieldy’.
Below is a video detailing on the updates with demos
Article: http://googledevelopers.blogspot.com/2013/03/bigquery-gets-big-new-features-to-make.html
This doesn't require a BIG JOIN, or a BIG GROUP BY aggregation, but you can run a few queries of this size on our public wikipedia revision dataset using the BigQuery API's free quota.
ReplyDelete/* In cases where we know the username of the editor, returns a table
* of the most prolific (by revision count) Wikipedia editors of (possibly)
* Auburn related pages, aggregated by month, ordered by rev count
*/
SELECT
title, contributor_username,
FORMAT_UTC_USEC(UTC_USEC_TO_MONTH(timestamp * 1000000)) AS revision_month,
COUNT(*) as revision_count
FROM
[publicdata:samples.wikipedia]
WHERE
contributor_username != '' AND
REGEXP_MATCH(title, "^(War Eagle|Auburn University|Auburn Tigers)(?i)")
GROUP BY
title, contributor_username, revision_month
ORDER BY
revision_COUNT DESC;
If you have any questions about the API, ask on Stack Overflow with the tag 'google-bigquery'