Wednesday, February 6, 2013

Is our life safe? An example from web hunt for DNA sequences


We have learned market-basket model on Tuesday. The basic concept is to cluster items together. The goal for many companies is to find what products customers may also like to buy. The most successful example is Target. The statistician from Target used this concept to analyze customers’ behavior. If you don’t know what I am talking about, please check piazza. In addition, Amazon did the same thing. When you buy something from Amazon, a recommended list will show on the bottom. The recommended list is associated with the items that other people bought. However, a hidden problem that people may ignore from their shopping behavior is their privacy and security.

Our previous experience tells us that anonymous data should be safe and private. But, the truth is not. A genetics researcher collected genetic data which is posted online. The DNA letters of the data are from more than 1,000 people. He randomly picked up five people, and he got their full name, sex, age, and address. Finally, he got information from 50 people including these five volunteers and their relatives. The study was published in the Journal Science. He is just a researcher who doesn't know any hack skills. Now, we can start worrying about how companies such as Target, Amazon or other business protect our personal information they gather. You can read the detailed information from the link below. I think it will be useful for us to know the risk of using data mining.

Most of the open source and commercial distribution offer some security, and the database is also secured behind a firewall. However, most security tools don’t work well with big data. For the next post, I may discuss some of the security tools that companies used to have. 

1 comment:

  1. Market-Basket Model-A Priori Algorithm
    A priori algorithm is an algorithm used for frequent itsemset mining. It proceeds level wise. It limits the need of memory and fails if number of items squared exceeds main memory. It is a 2 pass approach.
    1)It reads the number of baskets and assigns this count to main memory.
    2)In pass 2 it reads the baskets again and assigns count to only those items found in pass 1 which occurs ‘s’ number of times.

    ‘S’ is the frequency count which is determined by the user and is suited to the application for which this algorithm is written.

    Some examples where A Priori algorithm can be implemented in retail applications where items are grouped and effectively maintained in a database are as follows
    •Hash-based itemset counting: An itemset whose corresponding Hashing bucket count is below the threshold cannot be frequent.

    •Transaction reduction: A transaction that does not contain any Frequent k-itemset is useless in subsequent scans.

    •Partitioning: Any itemset that is potentially frequent in a database must be Frequent in at least one of the partitions of the database.

    •Sampling: Data mining techniques on a given subset of given data, results in a lower support threshold. Sampling can be done effectively to minimize the number of trials for further analysis or generating conclusive results.
    •Dynamic itemset counting: With this technique one can add new candidate itemsets only when all of their subsets are estimated to be frequent.
    References:
    1. Lecture notes on APRIORI algorithm by Professor Anita Wasilewska.
    2. Frequent Datasets and Patterns-Stanford University Lecture notes.

    ReplyDelete