Categorical variables (non-numeric
variables) are often hard to use in Artificial Neural Network (ANN ) or other
data mining tools. To be perfectly clear, categorical variables are a finite
set of values which are not in an order. For example, linguistic terms like low
and high or Countries and States are categorical variables. Usually, data
mining models cannot work with non continuous numbers. On the other hand, there
are some cases that using this type of variables is inevitable. A few ways exist
to present categorical variables in ANN or other data mining models. The simplest
way to present these variables is using dummy variables. It means, a single
categorical variable with n values is converted to the n new dummy variables. For
example, if you have a field “state” with 50 text labels, you should create 50
different new variables with value of 0 and 1. If
a record has the value "AL" in the variable State, the new dummy
column representing "AL" will have value "1", and all 49
other state dummy columns will have value "0". The following example
illustrates this process for 3 states.
State AL GA FL
Alabama 1 0 0
Georgia 0 1 0
Florida 0 0 1
The major disadvantage of this method
is that it expands the number of input variable dramatically and this can
affect the model sensitivity. Some software handles this process automatically.
To reduce the number of dummy variables,
classes with small representation can be presented under an "Other"
category. Although this discards information distinguishing those
small categories from one another, it can potentially reduce the number of inputs
substantially.
References:
http://abbottanalytics.blogspot.com/2005/04/beware-of-automatic-handling-of.html
https://www.mathworks.com/support/solutions/en/data/1-8H0STM/index.html?product=SL&solution=1-8H0STM
I also used an approach really close to this in building regression models in Minitab or Excel. but when one builds a regression model he should be cautious that if for example if he wants to build the regression model for the example that Shahab Mentioned in his post, the last column of each categorical variable should be deleted. that means the binary matrix that was created to get rid of categorical variables should be full rank (no column or row can be replicated by any linear operation on other rows or columns).
ReplyDelete