Wednesday, March 13, 2013

How to present categorical variables in ANN


Categorical variables (non-numeric variables) are often hard to use in Artificial Neural Network (ANN ) or other data mining tools. To be perfectly clear, categorical variables are a finite set of values which are not in an order. For example, linguistic terms like low and high or Countries and States are categorical variables. Usually, data mining models cannot work with non continuous numbers. On the other hand, there are some cases that using this type of variables is inevitable. A few ways exist to present categorical variables in ANN or other data mining models. The simplest way to present these variables is using dummy variables. It means, a single categorical variable with n values is converted to the n new dummy variables. For example, if you have a field “state” with 50 text labels, you should create 50 different new variables with value of 0 and 1. If a record has the value "AL" in the variable State, the new dummy column representing "AL" will have value "1", and all 49 other state dummy columns will have value "0". The following example illustrates this process for 3 states.

State               AL         GA           FL
Alabama          1             0              0
Georgia           0             1              0
Florida            0             0              1

The major disadvantage of this method is that it expands the number of input variable dramatically and this can affect the model sensitivity. Some software handles this process automatically. To reduce the number of dummy variables, classes with small representation can be presented under an "Other" category.  Although this discards information distinguishing those small categories from one another, it can potentially reduce the number of inputs substantially.

References:

http://abbottanalytics.blogspot.com/2005/04/beware-of-automatic-handling-of.html

https://www.mathworks.com/support/solutions/en/data/1-8H0STM/index.html?product=SL&solution=1-8H0STM

1 comment:

  1. I also used an approach really close to this in building regression models in Minitab or Excel. but when one builds a regression model he should be cautious that if for example if he wants to build the regression model for the example that Shahab Mentioned in his post, the last column of each categorical variable should be deleted. that means the binary matrix that was created to get rid of categorical variables should be full rank (no column or row can be replicated by any linear operation on other rows or columns).

    ReplyDelete