\documentclass[12pt]{article} %% \documentstyle[12pt,eclepsf]{article} \usepackage{graphicx} \usepackage{amsmath} \usepackage{amssymb} \usepackage{rsfs} \textheight 230mm \topmargin 5mm \textwidth 160mm \oddsidemargin -2mm \pagestyle{myheadings} \markright{(Pattern Recognition)} \newtheorem{formula}{Rule} \newtheorem{example}{Example} \newtheorem{excersize}{Excersize} \newtheorem{algorithm}{Algorithm} \begin{document} \thispagestyle{empty} \title{{\bf Pattern Classification} \\ {\large for \\ postgraduates \\ (9 weeks from 21 February to 17 April, 2007)}} \author{Akira Imada \\ Brest State Technical University, Belarus \\ \hspace*{1in}\\ \hspace*{1in}\\ (last modified on)} \maketitle %% \noindent %% (Not a big change this week, because my PC under FreeBSD, which I currently use to create %% Figures, has not worked this week. Alas! Damn it!) \clearpage \begin{center}{\bf INDEX} \end{center} \begin{itemize} \item Introduction \begin{itemize} \item What is Classification? \begin{itemize} \item Let's start with an example --- Female sermon and Male sermon. \end{itemize} \end{itemize} \item Bayesian Classification -- I \begin{itemize} \item What if we know only prior density of each class? \item How conditional posterior distribution probability improves liklyhood of classes? \item Bayesian Decision Theory. \item Classiffication by {\it likelyhood-ratio}. \end{itemize} \item Bayesian Classification -- II \begin{itemize} \item When class-conditional distribution of obserbations is Gaussian pdf? \begin{itemize} \item One-dimensional case. \item Two-dimensional case. \item Higher-dimensional case. \end{itemize} \end{itemize} \item Bayesian Classification -- III \begin{itemize} \item When we know class-conditional distribution $p(\mbox{\bf x} \vert \omega)$ but don't know some of its parameters $\boldsymbol{\theta}$? \begin{itemize} \item Maximam likelihood parameter estimation. \item Bayesian parameter estimation. \end{itemize} \end{itemize} \item Bayesian Classification -- IV \begin{itemize} \item When we don't know know class-conditional distribution $p(\mbox{\bf x} \vert \omega)$? \begin{itemize} \item Parzen window approach. \item Probabilistic Neural Network implementaion of Parzen window approach. \item $k_m$-nearest neighbor approach. \end{itemize} \end{itemize} \item Bayesian Belief Network. \item Hidden Markov Model. \item Fuzzy Logic Classification. \end{itemize} \clearpage \section{An example to start with.} In general, we need what {\it features} can be used for pattern classification purpose. This is, of course, one of big issues in pattern classification, but here we assume that we already know {\it features} which efficiently can be used to classify our target patterns.\\ \subsection{Which do you like better female or male?} I mean not human, but salmon. Well, we now assume, as an example, we want to classify salmons into female ones and male ones. The situation is we have thousands of salmons and our task is classify salmon not yet seen into female and male only by observing one feature $x$ --- length of the salmon. \subsection{If we only know prior probability.} If we don't know any prior information except for how many so far we have were female and male. Now we denote the class of female salmon as $\omega_1$ and the class of male salmon as $\omega_2$. Then above mentioned ``how many so far we have were female and male'' are denoted as $P(\omega_1)$ and $P(\omega_2)$, respectively, and called {\it prior probability}.\\ \noindent In this case, all we should act is with the following rule. \bigskip \begin{formula}[Classification only with prior probability] \label{rule-1} If $P(\omega_1) > P(\omega_2)$ then classify it to $\omega_1$ otherwise $\omega_2$. \end{formula} \noindent This might seem to be a trivial reaction. \subsection{If we also know posterior distribution of features.} If we assume that we know further distribution of length $s$ in each of female and male salmon, that is, $p(x \vert \omega_1)$ and $p(x \vert \omega_2)$, then our task of classify salmons with an observation of $x$ is a little more sophisticated.\\ \noindent Note here $p(\cdot)$ denotes a probability and $P(\cdot)$ denotes a density distribution.\\ \noindent Instinctively, we might put the threshold $\theta$ as in Fig.~1. If both classes are equally important then we put it at the center of those two distribution functions. But sometimes we might put it in different way like in Fig.~2, if we take it into acount that the risk of misclassify $x$ into $\omega_2$ while $x$ is $\omega_2$ is important\footnote{In the case of salmon, we tend to be happier if the one we bought a male salmon with cheaper price than female due to eggs, was mistakenly female.}.\\ \begin{figure}[bhtp]\label{fig:well-separate-gaussian} \begin{center} \includegraphics[width=15cm,height=5cm]{two-gaussian-2.eps} %% \epsfile{file=two-gaussian-2.eps,width=15cm,height=5cm} \end{center} \caption{An example of two distribution of the observation $x$ w.r.t. each class of $\omega_1$ and $\omega_2$. Here reader might image that distribution of length of female salmon and male sarmon.} \end{figure} \begin{figure}[bhtp] \begin{center} \includegraphics[width=15cm,height=5cm]{two-gaussian.eps} %% \epsfile{file=two-gaussian.eps,width=15cm,height=5cm} \end{center} \caption{Yet another example where two distributions are closer than the first example. Two threshold are shown, considering a risk of miss-classifcation.} \end{figure} \noindent In the example of Fig.~\ref{fig:well-separate-gaussian}, you might imagine {\it a classification fo human female and male by one feature --- frequency of her/his voice}. In this case $p(x \vert \omega_1)$ and $p(x \vert \omega_1)$ would be well separated and easy to classify. %% set a dicision boundary so as to minimize error %% error of misclassify salmon as sea-bass and vice versa %% generalization vs over-fitting %% minimum-error-rate %% or %% minimum cost(risk)rate %% recognition for classification %% Well the topic here is about sermon not human.\footnote{ %% {\small u}vs.{\small {yp} though {\small | {up} vs. %% {\small v}sp} {\small {uprp y{p}} \section{What is Bayesian Classification?} Assume we should guess the class that will come next with the knowledge of \begin{itemize} \item[(1)] Prior probabilities of class-1 $\omega_1$ and class-2 $\omega_2$, that is, $P(\omega_1)$ and $P(\omega_2)$. \item[(2)] Class conditional probability of the observatin of $x$ of each of the two classes, that is, $p(x \vert \omega_1)$ and $p(x \vert \omega_2)$. \end{itemize} Then we can calculate $P(\omega_i) \vert x)$ --- the probability that class is $\omega_i$ under the condition where $x$ is observed as: \begin{equation} \label{eq:001} P(\omega_i \vert x) = \frac{p(x \vert \omega_i)P(\omega_i)}{p(x)} \hspace{.1in} (i=1, 2), \end{equation} \noindent where \begin{equation} \label{eq:002} p(x) = p(x \vert \omega_1)P(\omega_1) + p(x \vert \omega_2)P(\omega_2) \end{equation} which is sometimes called {\it evidence}. Then our rule is:\\ \bigskip \begin{formula}[Classification with posterior probability]\label{rule-2} If $P(\omega_1 \vert x) > P(\omega_2 \vert x)$ then classify it to $\omega_1$ otherwise $\omega_2$. \end{formula} \subsection{Bayesian Decision Theory} Assume now we have multiple {\it features}: \begin{displaymath} x_1, x_2, \cdots, x_d \end{displaymath} mulitple {\it classes}: \begin{displaymath} \omega_1, \omega_2, \cdots, \omega_c \end{displaymath} also {\it actions}: \begin{displaymath} \alpha_1, \alpha_2, \cdots, \alpha_a \end{displaymath} \noindent instead of guessing which class. Then loss function $\lambda(\alpha_i \vert \omega_j)$ is defined as the loss when we take an action $alpha_i$ when the situation is $\omega_j)$.\\ \noindent If we take an action $\alpha_i$ when we observe {\bf x} while the true situation is $\omega_j$, the conditional {\it risk} is defiened as \begin{equation} R(\alpha_i \vert \mbox{\bf x}) = \sum_{j=1}^{c}{\lambda(\alpha_i \vert \omega_j) p(\omega_j \vert \mbox{\bf x})}. \end{equation} Note that we want to take the action so as to minimize this risk.\\ %% \noindent %% If we put it more generally %% $\alpha(x)$ corresponds either of $\alpha_i$ i=1..a to %% which action to take for every possible observation $\Rightarrow$ bayes desision theory \noindent Under continuous feature $\mbox{\bf x}$ we have observed, assume $c$ finite categories: ${\omega_1, \omega_2, ...\omega_c}$ and $a$ finite possible actions ${\alpha_1, \alpha_2, ...\alpha_a}$. Then $a$ loss function $\lambda(\alpha_i \vert \omega_j)$ is defiened as the loss caused by taking an action $\alpha_i$ when $\omega_j$ is the situation.\\ \noindent We now assume that we take action $\alpha_i$ when we observe $\mbox{\bf x}$ while the true situation is $\omega_j$ the loss function $\lambda(\alpha_i \vert \omega_j)$ will be defiend. Then expected loss will be \begin{equation} R(\alpha_i \vert \mbox{\bf x}) = \sum_{j=1}^{c}{\lambda(\alpha_i \vert \omega_j) P(\omega_j \vert \mbox{\bf x})} \end{equation} which called a {\it risk}. Recall here \begin{equation} P(\omega_j \vert \mbox{\bf x}) = \frac{p(\mbox{\bf x} \vert \omega_j)P(\omega_j)} {p(\mbox{\bf x})} \end{equation} where \begin{equation} p(\mbox{\bf x}) = \sum_{j=1}^{c}{p(\mbox{\bf x} \vert \omega_j)P(\omega_j)}. \end{equation} \subsection{Two-category classification} Let's simplify as $\lambda_{ij} = \lambda(\alpha_i \vert \omega_j)$, that is, \begin{eqnarray} \Lambda = \left( \begin{array}{cc} \lambda_{11} & \lambda_{12} \\ \lambda_{21} & \lambda_{22} \end{array} \right), \end{eqnarray} Then \begin{equation} \nonumber R(\alpha_1 \vert \mbox{\bf x}) = \lambda_{11}P(\omega_1 \vert \mbox{\bf x}) + \lambda_{12}P(\omega_2 \vert \mbox{\bf x}) \end{equation} \begin{equation} R(\alpha_2 \vert \mbox{\bf x}) = \lambda_{21}P(\omega_1 \vert \mbox{\bf x}) + \lambda_{22}P(\omega_2 \vert \mbox{\bf x}) \end{equation} Here we have: \begin{formula}[Rule for classification] If $R(\alpha_1 \vert \mbox{\bf x}) < R(\alpha_1 \vert \mbox{\bf x})$ then classify $\mbox{\bf x} to $ $\omega_1$ otherwise to $\omega_2$. \end{formula} \subsubsection{likelihood ratio} If we further assume that we have one feature $x$ instead of multiple $\mbox{\bf x}$, the if part of the rule above will be \begin{equation} (\lambda_{21} - \lambda_{11})P(\omega_1 \vert x) > (\lambda_{12} - \lambda_{22})P(\omega_2 \vert x). \end{equation} And usually we can assume by definition that $\lambda_{21}>\lambda_{21}>$ and $\lambda_{21}>\lambda_{21}>$, the rule will be: \begin{formula}[Rule for classification] If the following holds \begin{equation} \frac{p(x \vert \omega_1)}{p(x \vert \omega_2)} > \frac{\lambda_{12} - \lambda_{22}}{\lambda_{21} - \lambda_{11}}\frac{P(\omega_2)}{P(\omega_1)} \end{equation} then classify $x$ to $\omega_1$ otherwise to $\omega_2$. \end{formula} Note that the left-hand side is a function of $x$ and is called a {\it likelihood ratio} and the right-hand side is a quantity and is a {\it threshold} to discriminate two classes. \begin{figure}[bhtp] \begin{center} \includegraphics[width=14cm,height=7cm]{liklihood-ratio.eps} %% \epsfile{file=liklihood-ratio.eps,width=14cm,height=7cm} \end{center} \end{figure} \subsection{Discriminant function} Let's define a function, not probability or not density but just a function for the purpose of discriminate the space in to categories. \section{Examples of Decision Surface} \subsection{1-D Gaussian case.} We now assume \begin{displaymath} p(x \vert \omega_1) = \frac{1}{\sqrt{2\pi} \sigma_1} \exp\{{-\frac{1}{2} \frac{(x-\mu_1)^2}{{\sigma_1}^2}}\} \end{displaymath} and \begin{displaymath} p(x \vert \omega_2) = \frac{1}{\sqrt{2\pi} \sigma_2} \exp\{{-\frac{1}{2} \frac{(x-\mu_2)^2}{{\sigma_2}^2}}\}. \end{displaymath} Hence, one-dimensional version of our Rule-\ref{rule-2} is, \begin{center}{\it If $P(\omega_1 \vert x) > P(\omega_2 \vert x)$ then classify x to $\omega_1$ otherwise $\omega_2$. } \end{center} So, recalling \begin{displaymath} P(\omega_i \vert x) = \frac{p(x \vert \omega_i) P(\omega_i)}{P(x)} \end{displaymath} and $P(x)$ is common for both $i=1$ and $i=2$, the {\it IF-part} of the above rule is \begin{displaymath} \frac{1}{\sqrt{2\pi} \sigma_1} \exp\{{-\frac{1}{2} \frac{(x-\mu_1)^2}{{\sigma_1}^2}}\} P(\omega_1) > \frac{1}{\sqrt{2\pi} \sigma_2} \exp\{{-\frac{1}{2} \frac{(x-\mu_2)^2}{{\sigma_2}^2}}\} P(\omega_2) \end{displaymath} Now that we see all the terms in the equation are positive, the comparison holds true if we take logarithm based on $e$ on both-hands of the equation.\footnote{That is, \begin{displaymath} A > B \Leftrightarrow \ln(A) > \ln(B). \end{displaymath}} Therefore \begin{displaymath} \ln\frac{1}{\sqrt{2\pi}} + \ln\frac{1}{\sigma_1} - \frac{1}{2}\frac{(x-\mu_1)^2}{{\sigma_1}^2} + \ln P(\omega_1) > \ln\frac{1}{\sqrt{2\pi}} + \ln\frac{1}{\sigma_2} - \frac{1}{2}\frac{(x-\mu_2)^2}{{\sigma_2}^2} + \ln P(\omega_2) \end{displaymath} \bigskip \begin{excersize} Obtain the decision boundary $x_0$ when the two classes follow the Gaussian distributions with $N(1, \frac{1}{2})$ and $N(3, \frac{1}{2})$ respectively. \end{excersize} \subsection{2-D Gaussian case} Recalling Baysian formula Eq. (\ref{eq:001}) \begin{displaymath} P(\omega_i \vert \mbox{\bf x}) = \frac{p(\mbox{\bf x} \vert \omega_i)P(\omega_i)}{p(x)} \hspace{.1in} (i=1, 2), \end{displaymath} \noindent where $ p(\mbox{\bf x}) = p(\mbox{\bf x} \vert \omega_1)P(\omega_1) + p(\mbox{\bf x} \vert \omega_2)P(\omega_2) $ is just a term to normalize $P(\omega_i \vert \mbox{\bf x})$ to 1. Further, we assume $ P(\omega_1) = P(\omega_1) = 1/2 $ just for a sake of simplicity here. So our decision rule to determine which class of $\omega_1$ or $\omega_2$ should be assigned to {\bf x}: \begin{center} {\it ``If $P(\omega_1 \vert \mbox{\bf x}) > P(\omega_2 \vert \mbox{\bf x})$ then {\bf x} is $\omega_1$, otherwise $\omega_2$'' } \end{center} just depends on $p(\mbox{\bf x} \vert \omega_i)$ which we assume to be Gaussian p.d.f. That is \begin{equation} \label{eq: gaussian-multi-d} p(\mbox{\bf x} \vert \omega_i) = \frac{1}{(2\pi)^{d/2} \vert \Sigma_i \vert ^{1/2}} \exp\{- \frac{1}{2}(\mbox{\bf x} - \boldsymbol{\mu}_i)^t {\Sigma_i}^{-1} (\mbox{\bf x} - \boldsymbol{\mu}_i)\} \end{equation} \subsubsection{What will borders look like on what condition?} Now that we restrict our universe in two-dimensional space, we use a notation $(x,y)$ instead of $(x_1,x_2)$. So we now express $\mbox{\bf x} = (x, y)$ Furthermore, both of our two classes are assumed to follow the Gaussian p.d.f. whose $\mu$ are $\boldsymbol{\mu}_1 = (0,0)$ and $\boldsymbol{\mu}_2 = (1,0)$, and $\Sigma$ are \begin{eqnarray} \nonumber \Sigma_1 = \left( \begin{array}{cc} a_1 & 0 \\ 0 & b_1 \end{array} \right), \hspace*{.2in}\Sigma_2 = \left( \begin{array}{cc} a_2 & 0 \\ 0 & b_2 \end{array} \right) \end{eqnarray} \noindent Under this simple condition, our inverse matrix is simply, $\vert \Sigma_1 \vert = a_1b_1$ and $\vert \Sigma_2 \vert = a_2b_2$. So, we now know \begin{eqnarray} \nonumber \Sigma_1^{-1} = \frac{1}{a_1b_1} \left( \begin{array}{cc} b_1 & 0 \\ 0 & a_1 \end{array} \right) = \left( \begin{array}{cc} 1/a_1 & 0 \\ 0 & 1/b_1 \end{array} \right) \end{eqnarray} and in the same way \begin{eqnarray} \nonumber \Sigma_2^{-1} = \frac{1}{a_2b_2} \left( \begin{array}{cc} b_2 & 0 \\ 0 & a_2 \end{array} \right) = \left( \begin{array}{cc} 1/a_2 & 0 \\ 0 & 1/b_2 \end{array} \right) \end{eqnarray} Now Eq. (\ref {eq: gaussian-multi-d}) is more specifically \begin{eqnarray} \nonumber p(\mbox{\bf x} \vert \omega_1) = \frac{1}{2\pi \sqrt{a_1b_1}} \exp\{- \frac{1}{2} (x \hspace{.1in} y) \left( \begin{array}{cc} 1/a_1 & 0 \\ 0 & 1/b_1 \end{array} \right) \left( \begin{array}{c} x \\ y \end{array} \right)\} \end{eqnarray} and \begin{eqnarray} \nonumber p(\mbox{\bf x} \vert \omega_2) = \frac{1}{2\pi \sqrt{a_2b_2}} \exp\{- \frac{1}{2} (x-1 \hspace{.1in} y) \left( \begin{array}{cc} 1/a_2 & 0 \\ 0 & 1/b_2 \end{array} \right) \left( \begin{array}{c} x-1 \\ y \end{array} \right)\} \end{eqnarray} Then we can define our discriminat function $g_i(\mbox{\bf x}$ $i = 1, 2$ taking {logarithm based natural number $e$} as \begin{eqnarray} \nonumber g_1(\mbox{\bf x}) = - \frac{1}{2} (x \hspace{.1in} y) \left( \begin{array}{cc} 1/a_1 & 0 \\ 0 & 1/b_1 \end{array} \right) \left( \begin{array}{c} x \\ y \end{array} \right) + \ln (2\pi) + \frac{1}{2} \ln(a_1b_1) \end{eqnarray} and \begin{eqnarray} \nonumber g_2(\mbox{\bf x}) = - \frac{1}{2} (x-1 \hspace{.1in} y) \left( \begin{array}{cc} 1/a_2 & 0 \\ 0 & 1/b_2 \end{array} \right) \left( \begin{array}{c} x-1 \\ y \end{array} \right) + \ln (2\pi) + \frac{1}{2} \ln(a_2b_2) \end{eqnarray} Neglecting here the common term for both equation $\ln(2\pi)$, our new discriminant functions are \begin{displaymath} g_1(\mbox{\bf x}) = -\frac{1}{2}\{\frac{x^2}{a_1}+\frac{y^2}{b_1}\} + \frac{1}{2} \ln(a_1b_1) \end{displaymath} and \begin{displaymath} g_2(\mbox{\bf x}) = -\frac{1}{2}\{\frac{(x-1)^2}{a_2}+\frac{y^2}{b_2}\} + \frac{1}{2} \ln(a_2b_2) \end{displaymath} Finally, we obtain the border equation from $g_1(\mbox{\bf x}) - g_2(\mbox{\bf x}) = 0$. \begin{equation}\label{eq:two-d-gaussian-final} (\frac{1}{a_1}-\frac{1}{a_2})x^2 + \frac{2}{a_2}x + (\frac{1}{b_1}-\frac{1}{b_2})y^2 = \frac{1}{a_2} + \ln \frac{a_1b_1}{a_2b_2} \end{equation} \noindent We now know that the shape of the border will be either of the following five cases: (i) straight line (ii) circle; (iii) ellipse; (iV) parabola; (v) hyperbola; (Vi) two straight lines, depending on how the points distribute, that is, depending on $a_1$, $b_1$, $b_1$ and $b_2$ in our situation above. \subsubsection{Examples} Let's try some calculations under the above assumptions, that is, \begin{eqnarray} \nonumber (1) \hspace{.2in}\Sigma_1 = \left( \begin{array}{cc} 0.10 & 0 \\ 0 & 0.10 \end{array} \right), \hspace{.5in}\Sigma_2 = \left( \begin{array}{cc} 0.10 & 0 \\ 0 & 0.10 \end{array} \right) \end{eqnarray} \begin{eqnarray} \nonumber (2) \hspace{.2in}\Sigma_1 = \left( \begin{array}{cc} 0.10 & 0 \\ 0 & 0.10 \end{array} \right), \hspace{.5in}\Sigma_2 = \left( \begin{array}{cc} 0.20 & 0 \\ 0 & 0.20 \end{array} \right) \end{eqnarray} \begin{eqnarray} \nonumber (3) \hspace{.2in}\Sigma_1 = \left( \begin{array}{cc} 0.10 & 0 \\ 0 & 0.15 \end{array} \right), \hspace{.5in}\Sigma_2 = \left( \begin{array}{cc} 0.20 & 0 \\ 0 & 0.25 \end{array} \right) \end{eqnarray} \begin{eqnarray} \nonumber (4) \hspace{.2in}\Sigma_1 = %% In my PC this is not 4th but 5th. \left( \begin{array}{cc} 0.10 & 0 \\ 0 & 0.15 \end{array} \right), \hspace{.5in}\Sigma_2 = \left( \begin{array}{cc} 0.15 & 0 \\ 0 & 0.10 \end{array} \right) \end{eqnarray} \begin{eqnarray} \nonumber (5) \hspace{.2in}\Sigma_1 = %% In my PC this is not 5th but 7th. \left( \begin{array}{cc} 0.10 & 0 \\ 0 & 0.20 \end{array} \right), \hspace{.5in}\Sigma_2 = \left( \begin{array}{cc} 0.10 & 0 \\ 0 & 0.10 \end{array} \right) \end{eqnarray} \noindent The example the below is somewhat tricky. I wanted an example in which the right-hand side of the equation (\ref{eq:two-d-gaussian-final} becames zero and the left-hand side is a product of one-order equations of $x$ and $y$. As you might know, this is the case where border is made up of two straight lines. \begin{eqnarray} \nonumber (6) \hspace{.2in}\Sigma_1 = %% In my PC this is not 5th but 7th. \left( \begin{array}{cc} 2e & 0 \\ 0 & 0.5 \end{array} \right), \hspace{.5in}\Sigma_2 = \left( \begin{array}{cc} 1 & 0 \\ 0 & 1 \end{array} \right) \end{eqnarray} \noindent My quick calculation tentatively results in as follows. See also the Figure below. \begin{itemize} \setlength{\itemsep}{0pt} \item[(1)] $2x = 1 $ \item[(2)] $5 (x + 1)^2 + 5 y^2 = 10 - \ln4 $ \item[(3)] $5(x + 1)^2 + (8/3) y^2 = 10 - \ln (10/3)$ \item[(4)] $5(x + 1)^2 - (10/3) y^2 = 10 $ \item[(5)] $20x - 5 y^2 = 10 - \ln 2 $ \item[(6)] $(1 - 1/2e) x^2 - x - y^2 = 0 $ \end{itemize} \begin{figure}[bhtp] \label{fig:2d-ex} \begin{center} \includegraphics[width=5cm,height=5cm]{two-d-example-1.eps} \includegraphics[width=5cm,height=5cm]{two-d-example-2.eps} \includegraphics[width=5cm,height=5cm]{two-d-example-3.eps} \includegraphics[width=5cm,height=5cm]{two-d-example-6.eps} \includegraphics[width=5cm,height=5cm]{two-d-example-5.eps} \includegraphics[width=5cm,height=5cm]{two-d-example-4.eps} %% \epsfile{file=two-d-example-1.eps,width=5cm,height=5cm} %% \epsfile{file=two-d-example-2.eps,width=5cm,height=5cm} %% \epsfile{file=two-d-example-3.eps,width=5cm,height=5cm} %% \epsfile{file=two-d-example-5.eps,width=5cm,height=5cm} %% \epsfile{file=two-d-example-6.eps,width=5cm,height=5cm} %% \epsfile{file=two-d-example-4.eps,width=5cm,height=5cm} \end{center} \caption{A cloud of 100 points each extracted from a set of two classes and border of the two classes calculated on six different conditions. (Results of (5) and (6) are still fishy and under another trial.)} \end{figure} \noindent The final example in this sub-section is more general 2-dimensional case, but (artificially) devised so that calculations won't become very complicated. We now assume $\boldsymbol{\mu}_1 = (0,0)$ and $\boldsymbol{\mu}_2 = (1,1)$, and we both classes share the same $\Sigma$: \begin{eqnarray} \nonumber (7) \hspace{.2in}\Sigma_1 = \left( \begin{array}{cc} 2.0 & 0.5 \\ 0.5 & 2.0 \end{array} \right), \hspace{.5in}\Sigma_2 = \left( \begin{array}{cc} 5.0 & 4.0 \\ 4.0 & 5.0 \end{array} \right) \end{eqnarray} %% try (rotated hyperbola) %% (0,0) (1,1) %% 2.0 0.5 5 4 %% 0.5 2.0 4 5 %% \clearpage \subsection{3-D Gaussian case}\label{subsec} Here we study only one example. We assume two classes where $P(\omega_1) = P(\omega_1) = 1/2$. In each class, the patterns are distributed with Gaussian p.d.f both have the same covariance matrix \begin{eqnarray} \nonumber \Sigma = \left( \begin{array}{ccc} 0.3 & 0.1 & 0.1 \\ 0.1 & 0.3 & -0.1 \\ 0.1 & -0.1 & 0.3 \end{array} \right) \end{eqnarray} \noindent and means of the distribution are $(0, 0, 0)^T$ and $(1, 1, 1)^T$. We now take a look at what our discriminat function \begin{eqnarray}\label{eq:2-d-gaussian} g_i(\mbox{\bf x}) &=& - \frac{1}{2} (\mbox{\bf x}- \boldsymbol{\mu}_i)^T \Sigma_i^{-1}(\mbox{\bf x}- \boldsymbol{\mu}_i)%% + \ln p(\omega_i) + c_i \end{eqnarray} \noindent leads to? \\ \noindent Since we calculate (see APPENDIX III.) \begin{eqnarray} \nonumber \Sigma^{-1} = \frac{1}{30} \left( \begin{array}{ccc} 5 & -3 & -3 \\ -2 & 6 & 3 \\ -1 & 3 & 6 \end{array} \right) \end{eqnarray} \noindent Now our discriminat equation $g_1(\mbox{\bf x}) = g_2(\mbox{\bf x})$ is \begin{eqnarray} \nonumber (x_1 x_2 x_3) \left( \begin{array}{ccc} 5 & -3 & -3 \\ -2 & 6 & 3 \\ -1 & 3 & 6 \end{array} \right) \left( \begin{array}{c} x_1 \\ x_2 \\ x_3 \end{array} \right) = \hspace{1.5in} \end{eqnarray} \begin{eqnarray} \nonumber \hspace*{1.5in} ((x_1-1) (x_2-1) (x_3-1)) \left( \begin{array}{ccc} 5 & -3 & -3 \\ -2 & 6 & 3 \\ -1 & 3 & 6 \end{array} \right) \left( \begin{array}{c} x_1-1 \\ x_2-1 \\ x_3-1 \end{array} \right) \end{eqnarray} \noindent Further caluculation leads to \begin{eqnarray} \nonumber ((5x_1-2x_2-x_3)(-3x_1+6x_2+3x_3)) \left( \begin{array}{c} x_1 \\ x_2 \\ x_3 \end{array} \right)= \end{eqnarray} \begin{eqnarray} \nonumber %%(( (x_1-1)-2(x_2-1)- (x_3-1)) %%( -3(x_1-1)+6(x_2-1)+3(x_3-1)) %%( -3(x_1-1)+3(x_2-1)+6(x_3-1) (( 5x_1-2x_2- x_3-2) (-3x_1+6x_2+3x_3-6) (-3x_1+3x_2+6x_3-6)) \left( \begin{array}{c} x_1-1 \\ x_2-1 \\ x_3-1 \end{array} \right) \end{eqnarray} \noindent All the 2nd-order terms are cancelled and we obtain, \begin{displaymath} 7x_1 + 13x_2 - 20x_3 = 14 \end{displaymath} We now know that it is the plane which discriminates two of these classes $\omega_1$ and $\omega_2$. \subsection{A Higher order Gaussian case} The Equation \begin{eqnarray} \nonumber g_i(\mbox{\bf x}) &=& \ln (p(\mbox{\bf x} \vert \omega_i) + \ln P(\omega_i))\\ \end{eqnarray} still holds, of course. Now let's recall that the Gaussoian pdf is \begin{equation} p(\mbox{\bf x} \vert \omega) = \frac{1}{{(2\pi)}^{d/2} {\vert {\Sigma} \vert}^{1/2}} \exp \{-\frac{1}{2}(\mbox{\bf x}-\boldsymbol{\mu})^t \Sigma^{-1}(\mbox{\bf x}-\boldsymbol{\mu})\} \end{equation} \noindent and as such \begin{equation} \label{eq:discriminant-fn} g_i(\mbox{\bf x}) = -\frac{1}{2}(\mbox{\bf x}- \boldsymbol{\mu}_i)^t \Sigma_i^{-1}(\mbox{\bf x}-\boldsymbol{\mu}_i) - \frac{d}{2}\ln(2\pi)-\frac{1}{2}\vert \Sigma_i \vert + \ln P(\omega_i) \end{equation} \noindent We know take a look at cases which simplify situation more or less.\\ \bigskip \subsubsection*{$\bullet$ When $\Sigma_i = \sigma^2 I$} \noindent In this case, it's easy to guess samples fall in equal diameter hypershpers. Note, first of all $\vert \Sigma_i \vert = \sigma^{2d}$ and $\Sigma_i^{-1} = (1/\sigma^2)I$. So, we assume $g_i(\mbox{\bf x})$ here to be \begin{equation} g_i(\mbox{\bf x}) = - \frac{\| \mbox{\bf x} - \boldsymbol{\mu}_i \| ^2}{2\sigma^2} + \ln P(\omega_i) \end{equation} or, equivalently \begin{equation} \label{eq:discrimitnant} g_i(\mbox{\bf x}) = - \frac{1}{2\sigma^2} ({\mbox{\bf x}}^t {\mbox{\bf x}} - 2\boldsymbol{\mu}_i^t {\mbox{\bf x}} + \boldsymbol{\mu}_i^t\boldsymbol{\mu}_i) + \ln P(\omega_i) \end{equation} Neglecting the terms those no affecting to the relation $g_i(\mbox{\bf x})-g_j(\mbox{\bf x})= 0$ our $g_i(\mbox{\bf x})$ is now \begin{equation} g_i(\mbox{\bf x}) = \frac{1}{\sigma^2} \boldsymbol{\mu}_i^t {\mbox{\bf x}} - \frac{1}{2\sigma_i^2} \boldsymbol{\mu}_i^t\boldsymbol{\mu}_i + \ln P(\omega_i) \end{equation} Then $g_i(\mbox{\bf x})-g_j(\mbox{\bf x})=0$ leads to \begin{equation} \frac{1}{\sigma^2} (\boldsymbol{\mu}_i - \boldsymbol{\mu}_i)^t {\mbox{\bf x}} - \frac{1}{2\sigma_i^2} (\| \boldsymbol{\mu}_i \|^2 - \| \boldsymbol{\mu}_j \|^2) + \ln \frac{P(\omega_i)}{P(\omega_j)} = 0 \end{equation} \noindent In this case our classification rule will be\\ \bigskip \begin{formula}[Minimum Distance Classification] Measure Euclidean distance $\| \mbox{\bf x} - \boldsymbol{\mu} \|$ for $\forall i$, then classify $\mbox{\bf x}$ to the class whose mean is nearest to $\mbox{\bf x}$. \end{formula} \noindent If we carefully modify Eq. (\ref{eq:discrimitnant}) we will obtain \begin{equation}\label{eq:100} \mbox{\bf w}\cdot(\mbox{\bf x} - \mbox{\bf x}_0) \end{equation} \bigskip \begin{excersize} Derive the Eq. (\ref{eq:100}) specifying $\mbox{\bf w}$ and ${\mbox{\bf x}}_0$. \end{excersize} \bigskip \noindent Eq. (\ref{eq:100}) is the equation which can be interpret as \begin{center}{\it ``A hyperplane through ${\mbox{\bf x}}_0$ perpendicular to $vec{w}$.'' }\end{center} \subsubsection*{$\bullet$ When all $\Sigma_i = \Sigma$} \noindent This condition implies that the patterns in each of both classes distribute like hyper-elipsoid. Now that our discriminat function is \begin{displaymath} g_i(\mbox{\bf x}) = -\frac{1}{2}(\mbox{\bf x}-\boldsymbol{\mu})^t \Sigma^{-1}(\mbox{\bf x}-\boldsymbol{\mu}) \end{displaymath} \noindent We again obtain \begin{displaymath} \mbox{\bf w}\cdot(\mbox{\bf x} - \mbox{\bf x}_0) = 0 \end{displaymath} \noindent where $\mbox{\bf w}$ and $\mbox{\bf x}_0$ are \begin{equation} \mbox{\bf w} = \Sigma^{-1}(\boldsymbol{\mu}_i - \boldsymbol{\mu}_j) \end{equation} and \begin{equation} \mbox{\bf x}_0 = \frac{1}{2}(\boldsymbol{\mu}_i + \boldsymbol{\mu}_j) - \frac{\ln P(\omega_i)/P(\omega_j)} {(\boldsymbol{\mu}_i - \boldsymbol{\mu}_j)^t \Sigma^{-1} (\boldsymbol{\mu}_i - \boldsymbol{\mu}_j)} \end{equation} \noindent Notice here that $\mbox{\bf w}$ is no more parpendicular to the direction between $\boldsymbol{\mu}_i$ and $\boldsymbol{\mu}_j$. \bigskip \begin{excersize} Derive $\mbox{\bf w}$ and ${\mbox{\bf x}}_0$ above. \end{excersize} \bigskip \noindent So we modify the avove rule to \bigskip \begin{formula}[Classification by Mahalanobis distance] Assign {\bf x} to $\omega_i$ in which Mahalanobis distance from $\boldsymbol{\mu}_i$ is minimum for $\forall i$. \end{formula} \bigskip \noindent Yes! This {\it Mahalanobis distance} between {\bf a} and {\bf b} is defined as \begin{equation} (\mbox{\bf a} - \mbox{\bf b})^t \Sigma^{-1} (\mbox{\bf a}- \mbox{\bf b}) \end{equation} \noindent The final example in this sub-section is more general 2-dimensional case, but (artificially) devised so that calculations won't become very complicated. We now assume $\boldsymbol{\mu}_1 = (0,0)$ and $\boldsymbol{\mu}_2 = (1,0)$, and we both classes share the same $\Sigma$: \begin{eqnarray} \nonumber (7) \hspace{.2in}\Sigma_1 = \left( \begin{array}{cc} 1.1 & 0.3 \\ 0.3 & 1.9 \end{array} \right), \hspace{.5in}\Sigma_2 = \left( \begin{array}{cc} 1.1 & 0.3 \\ 0.3 & 1.9 \end{array} \right) \end{eqnarray} \bigskip \subsubsection*{$\bullet$ When all $\Sigma_i$'s are arbitrary} \noindent When no such restriction as above to simplify situation, the discriminant function is \begin{displaymath} g_i(\mbox{\bf x}) = -\frac{1}{2}(\mbox{\bf x}-\boldsymbol{\mu})^t \Sigma_i^{-1}(\mbox{\bf x}-\boldsymbol{\mu}) -\frac{d}{2} \ln 2\pi - \frac{1}{2} \ln \vert \Sigma_i \vert + ln P(\omega_i) \end{displaymath} Only the term we can neglect now is $(d/2) \ln 2\pi$. We now apply the identity \begin{displaymath} (\mbox{\bf x} - \mbox{\bf y})^t A (\mbox{\bf x} - \mbox{\bf y}) = \mbox{\bf x}^t A \mbox{\bf x} - 2 (A \mbox{\bf y})^t\mbox{\bf x} + \mbox{\bf y}^t A \mbox{\bf y}. \end{displaymath} Then, we get the following renewed discriminant function \begin{equation} g_i(\mbox{\bf x}) = \mbox{\bf x}^t W_i \mbox{\bf x} + \mbox{\bf w}_i^t \mbox{\bf x} + w_{i0} \end{equation} where \begin{displaymath} W_i = - \frac{1}{2} \Sigma_i^{-1} \end{displaymath} \begin{displaymath} \mbox{\bf w}_i = \Sigma_i^{-1}\boldsymbol{\mu}_i \end{displaymath} \begin{displaymath} w_{i0} = -\frac{1}{2} \boldsymbol{\mu}_i^t \Sigma_i^{-1} \boldsymbol{\mu}_i - \frac{1}{2} \ln \vert \Sigma_i \vert + \ln P(\omega_i). \end{displaymath} \bigskip \noindent Hence, $g_i(\mbox{\bf x}) - g_j(\mbox{\bf x}) = 0$ leads us to a {\it hyper quadratic form}. Or, if you want, we can express it as \bigskip \begin{displaymath} (a_1 x_1 + a_2 x_2 + \cdots + a_n x_n)(b_1 x_1 + b_2 x_2 + \cdots + b_n x_n) = const. \end{displaymath} \bigskip \noindent Namely, the border is either of (i) Hyper-planes; (ii) a pair of hyper-planes; (iii) hyper-sphere; (iv) yyper-ellipsoid; (v) hyper paraboloid; (vi) hyper-hyperboloid. \subsubsection{2-D cases revisit} %%3,5 5,3 %%0.5 0.3 %%0.3 0.4 \clearpage \section{Parameter Estimation} Goal is to know $p(\omega_i \vert \mbox{\bf x})$. \noindent Assume now that we know the density functin for some reason to believe. What we don't know is the parameter of the density function.\\ \noindent To get a bird's eye view, we name a few of 1-D examples, besides Gaussian, of such density functions which need to be specified by some parameters.\\ \begin{itemize} \setlength{\itemsep}{0pt} \item Uiform-distribution, \begin{itemize} \item $p(x) = 1/\theta $ ... if $x>0$ otherwise 0 \end{itemize} \item Exponential \begin{itemize} \item $f(x) = \theta \exp(-\theta x)$ ... if $x>0$ otherwise 0 \end{itemize} \item Rayleigh (See Figure below.) \begin{itemize} \item $f(x) = 2\theta x \exp(-\theta x^2)$ ... if $x>0$ otherwise 0 \end{itemize} \item Poisson \begin{itemize} \item $f(x) = (\theta^x/x!)\exp(- \theta)$ ... for $x=0, 1, 2, \cdots$ \end{itemize} \item Bernoulli \begin{itemize} \item $f(x) = \theta^x (1 - \theta)^{1-x}$ ... for $x=0,1$ \end{itemize} \item Binomial \begin{itemize} \item $f(x) = (m!/x!(m-x)! \cdot \theta^x (1-\theta)^{m-x}$ ... for $x = 0, 1, \cdots, m$ \end{itemize} \end{itemize} \begin{figure}[bhtp] \begin{center} \includegraphics[width=5cm,height=5cm]{rayleigh.eps} %% \epsfile{file=rayleigh.eps,width=5cm,height=5cm} \end{center} \caption{Rayleigh distribution function with four different values of $\theta$.} \end{figure} \noindent All we don't know is its parameters to specify the function. But we have an information of prior probability $p(\theta)$ and training sample give us $p(\theta \vert D)$ which we expect to have a sharp peak at the true $\theta$. \\ \subsection{Maximum Likelihood Estimation} Before entering this topic, try the following case where we have two classes each of which has only 4 training samples. \bigskip \begin{example} Assuming in a 2-dimensional space, what if we have a pair of 4 samples from each of two classes? The patterns we have are $(8,3), (4,3), (6,2), (6,4)$ from $\omega_1$0 and $(0,3), (-2,1), (-4,3), (-2,5)$ from $\omega_2$. Try to guess the border between two classes, and then classify a new point $(5,3)$, for example. \end{example} \bigskip \noindent We now further more simplify the situation.\\ \bigskip \begin{example} Our data is now $x_1 = 3$, $x_2 = 8$, $x_3 = 2$ and $x_4 = 5$. Guess what distribution function of these 4 data follows? \end{example} \bigskip \noindent This might be a Gaussian distribution but just a brief look at it suggests it more like a Rayleigh distribution. \bigskip %% 1, 2, 4, 7 %% rayleigh %% 0.095, 0.164, 0.180, 0.060 when theta = 0.05 %% 0.181, 0.268, 0.162, 0.010 when theta = 0.10 \begin{figure}[bhtp] \begin{center} \includegraphics[width=5cm,height=5cm]{rayleigh-2.eps} \includegraphics[width=5cm,height=5cm]{rayleigh-3.eps} %% \epsfile{file=rayleigh-2.eps,width=5cm,height=5cm} %% \epsfile{file=rayleigh-3.eps,width=5cm,height=5cm} \end{center} \caption{A cloud of 200 points each with different three different $N[\mu, \Sigma]$.} \end{figure} \bigskip \begin{excersize} (1) We now assume Gaussian with $\sigma = 1$ with $\mu$ being unknown. Then estimate $\mu$ from the same 4 data, i.e. $x_1 = 3$, $x_2 = 8$, $x_3 = 2$ and $x_4 = 5$, in the same way as above. (2) Next, create 10, 20, 100 data randamly from $N(1, 3)$, and plot $p(D \vert \mu)$ as a function of $\mu$ in each of the three cases. \end{excersize} \begin{formula}[An assamption of independence of data] Take $\theta$ which maximizes \begin{equation} P(D \vert \theta) = \prod_{k=1}^{n} P(x_k \vert \theta) \end{equation} \end{formula} \noindent In short, choose $\theta$ which maximizes \begin{equation} P(D \vert \theta) = \prod_{k=1}^{n} p(x_k \vert \theta) \end{equation} \noindent Then let's calculate our previous example of one-dimensional four data $D = \{2, 3, 5, 8\}$ by changign $\theta$ from 0.00 to 1.00 with an interval of 0.01. The results are shown the left-most graph of the Fig.~\ref{fig:ML}. It might be interesting what if we have more data to estimate. The same procedures are made with the number of data 10, 20, 100 and the results are shown in the same Figure. Important thing is is \begin{center}{\it ``The more data we have, the sharper the graph becomes.'' } \end{center} \begin{figure}[bhtp]\label{fig:ML} \begin{center} \includegraphics[width=3cm,height=3cm]{ml-rayleigh.eps} \includegraphics[width=3cm,height=3cm]{ml-rayleigh-10.eps} \includegraphics[width=3cm,height=3cm]{ml-rayleigh-20.eps} \includegraphics[width=3cm,height=3cm]{ml-rayleigh-100.eps} %% \epsfile{file=ml-rayleigh.eps,width=3cm,height=3cm} %% \epsfile{file=ml-rayleigh-10.eps,width=3cm,height=3cm} %% \epsfile{file=ml-rayleigh-20.eps,width=3cm,height=3cm} %% \epsfile{file=ml-rayleigh-100.eps,width=3cm,height=3cm} \end{center} \caption{A cloud of 200 points each with different three different $N[\mu, \Sigma]$.} \end{figure} \begin{figure}[bhtp] \begin{center} \includegraphics[width=10cm,height=5cm]{gaussian.eps} %% \epsfile{file=gaussian.eps,width=10cm,height=5cm} \end{center} \caption{A cloud of 200 points each with different three different $N[\mu, \Sigma]$.} \end{figure} \begin{figure}[bhtp] \begin{center} \includegraphics[width=10cm,height=5cm]{uniform.eps} %% \epsfile{file=uniform.eps,width=10cm,height=5cm} \end{center} \caption{A cloud of 200 points each with different three different $N[\mu, \Sigma]$.} \end{figure} \subsection{Bayesian parameter estimation} Since we view here the parameter $\boldsymbol{\theta}$ as random variables with an information of prior distribution $p(\boldsymbol{\theta})$, we view $P(\boldsymbol{\theta} \vert D)$ as a function of $\boldsymbol{\theta}$. As before, we assume \begin{equation} P(D \vert \theta) = \prod_{k=1}^{n} p(x_k \vert \theta) \end{equation} \noindent where we denote the training data \begin{displaymath} D_n = \{x_1, x_2, \cdots, x_n\} \end{displaymath} \noindent which implies \begin{displaymath} P(D_n \vert \theta) = P(x_n \vert \theta) P(D_{n-1} \vert \theta) \end{displaymath} \noindent Therefore the Baysean Formula \begin{equation} p(\theta \vert D) \approx P(D \vert \theta) P(\theta). \end{equation} leads us like \begin{eqnarray} \nonumber P(\theta \vert D_n) \nonumber &=& P(D_n) \vert \theta)P(\theta) \\ \nonumber &=& P(x_n \vert \theta) P(D_{n-1}) \vert \theta)P(\theta) \\ \nonumber &=& P(x_n \vert \theta) P(\theta \vert D_{n-1}) \end{eqnarray} \clearpage \section{Non-paremetric density estimation} Our assumption: so far are \begin{itemize} \item [(1)] density function $p(\mbox{\bf x} \vert \omega)$ is known \item [(2)] high dimensional density function can be represented as the product of one-dimensional function \end{itemize} \noindent Now let's make a more realistic assamption, "We don't know the distribution of samples, i.e., we don't know the form of the density function. Let's estimate $p(\mbox{\bf x} \vert \omega)$ from our sample patterns. Or, let's estimate $P(\omega \vert \mbox{\bf x})$ directly from our sample patterns. The typical example is {\it Nearest neighbor rule.} \section{Unknown Density function Estimation} One of the most elementary formula in probability theory tells us the probability that $\mbox{\bf x}$ will fall in region $R$ will be \begin{equation}\label{eq:elementary-prob} P= \int_{R} p(\mbox{\bf x}) d \mbox{\bf x} \end{equation} Recall, if it's in 1-dimensional case, the probability that $x$ will fall in the region $[x_1, x_2]$ will be \begin{displaymath} P= \int_{x_1}^{x_2} p(x) dx \end{displaymath} \noindent The probability of $k$ samples out of $n$ samples of $\{x_1, x_2, \cdots, x_n\}$ will fall in the region $R$ is \begin{equation}\label{eq:binominal} {n \atopwithdelims() k} P^k(1-P)^{n-k} \end{equation} \noindent So, we can assume \begin{equation} k = nP \end{equation} holds. On the other hand, Eq. (\ref{eq:elementary-prob}) can be \begin{equation} P = p(\mbox{\bf x}) = V \end{equation} where $V$ is the volume of region $R$. This is true on condition that $V$ is small enough. If you want to think of in 1-dimensional space this is \begin{displaymath} p(x)\cdot (x_2-x_1) \end{displaymath} on condition that $x_2-x_1 \approx 0$. Hence, we now have a relation \begin{equation} p(x) = \frac{k/n}{V}. \end{equation} \bigskip \begin{excersize} Calculate the Eq (\ref{eq:binominal})above in each case of, say, $n=4, 20, 100$ and plot the probability as a function of $k/n$ assuming real $P$ is, say, 0.6. Make sure the larger the value of $n$ the sharper the peak. \end{excersize} \noindent Now in order to estimate the density $p(\mbox{\bf x})$ at {\bf x}, let's creat a sequence of region $R_1, R_2, R_3, \cdots$ each of which is a region where estimation is made with $n$ samples. Now our fundamental Equation becomes \begin{equation} p_n(\mbox{\bf x}) = \frac{k_n/n}{V_n} \end{equation} \noindent The condition for this to converge is \begin{equation} \lim_{n \to \infty} k_n = \infty \end{equation} \begin{equation} \lim_{n \to \infty} V_n = 0 \end{equation} \begin{equation} \lim_{n \to \infty} k_n/n = 0 \end{equation} \noindent To realize these conditons we have basically two approaches. In the one approach, we shring $V-n$ as $n$, the number of training samples, increases, for example, according to $V_n = V_0/\sqrt{n}$. In the other approach, we enlarge $V_n$ as $n$ increases, for example $k_n = k_0 \sqrt{n}$. \noindent The former is called {\it Parzen-window} approach and the latter {\it $k_n$ Nearest Neighbor} approach, both of which we are going to exploring in the next two subsections. \subsection{Parzen-Window classifier} First of all, we define new function \begin{eqnarray} \varphi (\mbox{\bf u}) = \left\{ \begin{array}{ll} 1 & \hspace{.2in}\mbox{if}\hspace{.2in} -\frac{1}{2} < u < \frac{1}{2}\\ 0 & \hspace{.2in}\mbox{if}\hspace{.2in} otherwise\\ \end{array} \right. \end{eqnarray} \noindent Then we get \begin{eqnarray} \varphi (\mbox{\bf x} - \mbox{\bf x}_i/h_n) = \left\{ \begin{array}{ll} 1 & \hspace{.2in}\mbox{if}\hspace{.2in} x_i \hspace{.1in} \mbox{falls within the hypercube}\\ 0 & \hspace{.2in}\mbox{if}\hspace{.2in} otherwise\\ \end{array} \right. \end{eqnarray} where the hypercube is locate such that its center at {\bf x} and the size of all hedges is $h_n$. Here we can put $k_n$ as \begin{equation} k_n = \sum_{i=1}^n \varphi (\mbox{\bf x} - \mbox{\bf x}_i/h_n) \end{equation} \noindent It's important to note that this is not a function of {\bf x} at this moment. Now recalling our goal is to obtain the density $p(\mbox{\bf x})$ at {\bf x}, we are happy to obtain the following equation. \begin{equation} p_n(\mbox{\bf x}) = \frac{1}{n} \sum_{i=1}^n {\frac{1}{V_n} \varphi (\frac{\mbox{\bf x} - \mbox{\bf x}_i}{h_n})} \end{equation} \bigskip \begin{excersize} Assumming $p(x) = U(0,1)$, try to find what happens if wu use $\pi_n(x)$ as Parzen window function. Draw $p_n(x)$ in each case of $h_n=1, 1/4, 1/16$. \end{excersize} \subsubsection{A Neural Network Installation} \begin{figure}[bhtp] \begin{center} \includegraphics[width=8cm,height=5cm]{nn-for-bayes.eps} \end{center} \caption{A cloud of 200 points each with different three different $N[\mu, \Sigma]$.} \end{figure} \begin{figure}[bhtp] \begin{center} \includegraphics[width=8cm,height=5cm]{nn-for-bayes-2.eps} \end{center} \caption{A cloud of 200 points each with different three different $N[\mu, \Sigma]$.} \end{figure} \subsection{Nearest-Neighbor classifier} \section{Baesian Blief Network} We now assume we have $N$ variables like our previous example of salmon. \begin{center}{\it Salmon{female, male} \\ Season {spring, summer, autumn, winter}\\ Location{sea, river, lake} \\ Length{short, midium, long} }\end{center} \noindent Or in more general form \begin{displaymath} A(a_1, a_2, \cdots) \end{displaymath} \begin{displaymath} B(b_1, b_2, \cdots) \end{displaymath} \begin{displaymath} C(c_1, c_2, \cdots) \end{displaymath} and we also know, for example, \begin{displaymath} p(\mbox{\bf a} \vert \mbox{\bf b}) \end{displaymath} \begin{displaymath} p(\mbox{\bf b} \vert \mbox{\bf c}) \end{displaymath} \noindent etc... then we can fepresent this by a kind of network like: \bigskip \begin{figure}[bhtp] \begin{center} \includegraphics[width=5cm,height=5cm]{blief-net.eps} \end{center} \end{figure} \noindent We call this network {\it Bayesian Blief Network}. Assumption here is we know all ... \noindent The aim is \begin{itemize} \item To classify a new data {\bf x}. \item To know the probability of an unknown variable from other known variables. \end{itemize} \begin{figure}[bhtp] \begin{center} \includegraphics[width=5cm,height=5cm]{blief-net-2.eps} \end{center} \end{figure} \noindent We now take $\mbox{\bf x} = (x_1, x_2, \cdots)$ as a {\it belief} of $x_1, x_2, \cdots$ under {\it evidences} $\mbox{\bf e} = (e_1, e_2, \cdots)$ where some of the $e_i$ belong to {parents} variable and the others to {children}. Then the elementary calculation is \begin{equation} p(\mbox{\bf x} \vert \mbox{\bf e}_{\mbox {\small all}}) = p(\mbox{\bf e}_{\mbox {\small children}} \vert \mbox{\bf x}) \cdot p(\mbox{\bf x} \vert \mbox{\bf e}_{\mbox {\small parents}}) \end{equation} where \begin{equation} p(\mbox{\bf e}_{\mbox {\small children}} \vert \mbox{\bf x}) = p(e_{c_1} \cdots e_{c_n} \vert \mbox{\bf x}) = p(e_{c_1} \vert \mbox{\bf x}) \cdots p(e_{c_1} \vert \mbox{\bf x}) = \Pi_{i} p(e_{c_i} \vert \mbox{\bf x}) \end{equation} and \begin{equation} p(\mbox{\bf x} \vert \mbox{\bf e}_{\mbox {\small parents}}) = p(\mbox{\bf x} \vert e_{p_1}, e_{p_2}, \cdots) = \sum p(\mbox{\bf x} \vert p_1, p_2, \cdots) p(p_1, p_2, \cdots \vert e_{p_1}, e_{p_2}, \cdots) \end{equation} where \begin{equation} p(p_1, p_2, \cdots \vert e_{p_1}, e_{p_2}, \cdots) = p(p_1 \vert e_1)\cdotp(p_2 \vert e_2) \cdots \end{equation} Like \begin{displaymath} p(x_1) = p(x_1 \vert a_1 b_1)p(a_1)p(a_2) + \cdots \end{displaymath} \noindent For example, in our example of salmon, assume we know all \begin{eqnarray} \begin{array}{ccc} \hspace*{.1in} & female & male \\ winter & 0.9 & 0.1 \\ spring & 0.3 & 0.7 \\ summer & 0.4 & 0.6 \\ autumn & 0.8 & 0.2 \end{array} \end{eqnarray} \begin{example} \hspace*{1in}\\ \begin{itemize} \setlength{\itemsep}{0pt} \item[(1)] A new salmon was caught in {\it winter}, at the {\it lake}. The size is {\it long} and very {\it thin}. Then, what is the probability that it is a female salmon? \item[(2)] A new salmon we have is {\it short}, {\it thick}, and cought in a {river}. Then what is the most likely season when it was caught? \item[(3)] We have a {\it long}, {\it thin} salmon caught in {\it winter}. What is the probability that it is from sea. \item[(4)] It is caught between {\it autumn} and {\it winter} from a {\it lake}, and we have no information as for the length of the fish, but this is {\it medium thickness}. Is the salmon male? \item[(5)] {\it Short} and {\it thick} which season you guess that the salmon was caught? \item[(6)] {\it Long}, {\it thick} from a {\it lake} what season is mostlikely? \end{itemize} \end{example} \section{Hidden Markov Model} \subsection{Markov Model} We now assume that pattern is a sequence of states $x(t)$, something like \begin{displaymath} {x(1), x(2), x(3), \cdots, x(T)} \end{displaymath} Here we assume time is descrete like $1,2,3, \cdots, T$. Let's denote a sequece of states whose length is $T$ as $\mbox{\bf x}^T$ Imagine \begin{displaymath} \mbox{\bf x}^7 = {x_1, x_6, x_3, x_3, x_1, x_2, x_4} \end{displaymath} An example of states is a whether. Think of $x_1$ is fine whether, $x_2$ is cloudy day, $x_3$ is a rainy day, and so on. Assume we know the transition probability from any state to any state, like \begin{equation} P(x_j(t+1) \vert x_i(t)) = a_{ij}. \end{equation} We call a full set of these $a_{ij}$ {\it model}. When we denote this full set of $a_{ij}$ as $\boldsymbol{\theta}$, the probability of above example of sequence $\mbox{\bf x}^7$ under the model $\boldsymbol{\theta}$ is expressed as \begin{equation} P(\mbox{\bf x}^7 \vert \boldsymbol{\theta}) = a_{16} \cdot a_{63} \cdot a_{33} \cdot a_{31} \cdot a_{12} \cdot a_{24}. \end{equation} \noindent Our goal here is to find a model which best explains training patterns of a known class, and later, a test pattern is classified by the model that has the highest posterior probability $P(\omega \vert \mbox{\bf x})$. \subsection{Hidden Markov Model} Assume now the state $x(t)$ is not observable, but emits visible symbol $y(t)$. For example, \begin{equation} {y_5, y_3, y_2, y_2, y_7, y_2, y_3} \end{equation} If we know \begin{equation} P(y_k(t) \vert x_j(t)) = b_{jk} \end{equation} Where $\sum_j a_{ij} = 1$ for all $i$ and $\sum_j b_{jk} = 1$ for all $j$. This is called a {\it Hidden Markov Model} because a sequence $x(t)$ is hidden to observers.\\ \noindent Then we have three different problems. \begin{itemize} \item Evaluation Problem \begin{itemize} \item Evaluation of probabilitiy of a particular sequence of visible state. \end{itemize} \item Decoding Problem \begin{itemize} \item Assuming we know all of $a_{ij}$ and $b_{jk}$, determinatin of the most likely sequence of hidden states for a given observation of visible sequence. \end{itemize} \item Learning Problem \begin{itemize} \item Estimation of $a_{ij}$ and $b_{jk}$ from a given set of training obserbations of visible sequences. \end{itemize} \end{itemize} \section{Fuzzy Classification} Degree of belongingness -> degree of membership from 0 to 1.\\ \noindent Examples of membership function \begin{figure}[bhtp] \begin{center} \end{center} \caption{Trapezoid on corner, trapezoid, triangle singleton, traiangle on corner} \end{figure} \noindent Do you like cold beer? \begin{figure}[bhtp] \begin{center} \end{center} \caption{Very cold, cold, not-so cold, worm} \end{figure} \noindent Yet another form of membership function --- Cauchy Bell-like \begin{equation} \mu(x) = \frac {1}{1 + (x-12)^2} \end{equation} All real number close to 12 \begin{figure}[bhtp] \begin{center} \end{center} \caption{All real number close to 12.} \end{figure} gaussian membership function \subsection{Basic Arethmetic} $AuB$ $AcupB$ $Abar$ \subsubsection{} When two membership function have a same domain (which is sometimes called {\it universe of dicourse}) An example (comparison with traditional set theory) close to 5 and close to 7 close to 5 or close to 7 not close to 5 1) 2) 3) \subsection{when x and y are defiened on different domain} young and tall given membership of each 15-35, 170-180 e.g. how we represent 1) 3-d 2) gfray scale 3) descretize -> matrix 175 180 185 190 15 20 25 30 35 A and B A or B Standard Min(mu_a, mu_b) Max() mu_a mu_b mu_a + mu_b - mu_a mu_b Lukasiewicz max (0, mu_a + mu_b -1} min(1, mu_a + mu_b) Einstein mu_a mu_b/(2-(mu_a+mu_b)+mu_a mu_b) (mu_a + mu_b)/(1+ mu_a mu_b) excersize given mu_a, mu_b as ... \subsection{IF-THEN} Mamdani's definition $\mu_{A->B}$ = min(mu_a, mu_b) Larsen = mu_a mu_b It's important because the goal of fuzzy, inshort, is to obtain y =f(x) from a set of fuzzy rule \subsection{Let's visualize IF-THEN rules} 1) simplest case example if x=low then y = high given membership as 10-40 300-400 mu_R (x1, x2) = min(mu_L(x1), mu_H(x2)) 2) If precondition is multiple like IF x1 = low and x2 =midium then y=high mu_R (x1, x2) = min(mu_L(x1), mu_M(x2), mu_H(y)) is defiend on 3-D space (x1, x2, y) Hence connot observe Only when we give x1, and x2 a specific value mu(x1,x2,y) vs y canbe plotted on 1-D axis of y 3) what about two rules like R1: if x=low, then y = high R2: if x=midium, then y = midium again only plot giving specific value of x Then mu_R1(y), and mu_R2(y) can be seen as befor This time we further MAX due to R1 OR R2 Now we plott mu(y) on y axisx Why this (mu(y) on y) makes us happy? Defuzzification center of gravity Example in more general case distance small, ..., large my_speed very low ... very high brake weak... stribg R1: if x1 = .. AND x2 = ... THEN y = .. R2 R3 Then if x1 = 30 m my-spead is 50km/h then how much should be y? \subsection{Fuzzy Classification} \subsubsection{fixed membership} \subsubsection{Takagi Sugeno Model} \clearpage \section*{APPENDIX} \subsection*{I. Bayes formula in another scenario} \begin{itemize} \item[$\cdot$] Three prisoners ({\bf A}, {\bf B}, and {\bf C}) are in a prison. \item[$\cdot$] {\bf A} knows that the two out of the three are to be executed tomorrow, and the rest becomes free. \item[$\cdot$] {\bf A} thought either one of {\bf B} or {\bf C} is sure to be executed. \item[$\cdot$] Then, {\bf A} asked a guard ``even if you tell me which of {\bf B} and {\bf C} is executed, that will not give me any information as for me. So please tell it to me.'' \item[$\cdot$] The guard answers that {\bf C} will. $\Rightarrow$ data $D$ \item[$\cdot$] Now, {\bf A} knows one of {\bf A} or {\bf B} is sure to be free. \end{itemize} \begin{center} \underline{\hspace*{7in}} \end{center} %%\hspace*{1in}\\ \noindent Do you guess probability $p(A \vert D) = 1/2$?\\ \hspace*{1in}\\ If this is correct, then the answer of the guard had given an information as for A, since probability $p(A)=1/3$.\\ \noindent You agree that \begin{displaymath} p(A) = p(B) = p(C) = 1/3. \end{displaymath} Then, try to apply Bayesian rule, i.e., obtain the conditional probability of the data ``C will be executed'' under the condition that ``A will be free tomorrow'' And in the same way for B and C. They are: \begin{eqnarray} \nonumber p(D \vert A) &=& 1/2. \\ \nonumber p(D \vert B) &=& 1. \\ \nonumber p(D \vert C) &=& 0. \end{eqnarray} In conclusion: \begin{displaymath} p(A \vert D)=\frac{p(D \vert A)p(A)} {p(D \vert A)p(A)+p(D \vert B)p(B)+p(D \vert C)p(C)} =1/3. \end{displaymath} This shows probability did not change after the information! \clearpage \begin{itemize} \item Now it's clear that \begin{itemize} \item $p(A)$ and so on are to be called \begin{itemize} \item[] $\star$ {\it a priori} probabilities; \end{itemize} \item and $p(A \vert D)$ and so on to be \begin{itemize} \item[] $\star$ {\it a posteriori} probabilities.\\ \end{itemize} \end{itemize} \item In the same way, \begin{itemize} \item $p(\omega_i)$ is called \begin{itemize} \item[] $\star$ {\it a priori} probability\footnote{Usually given, but if unknown, it can be estimated as $N_i/N$ where $N_i$ is the number of training samples which belong to class $\omega_i$, and $N$ is the total number of training samples}; \end{itemize} \item $p(\omega_i \vert \mbox{\bf x})$ \begin{itemize} \item[] $\star$ {\it a posteriori} probabilities.\\ \end{itemize} \end{itemize} \item[]Furthermore \begin{itemize} \item $p(\mbox{\bf x})$ is called \begin{itemize} \item[] $\star$ p.d.f. of {\bf x} \end{itemize} \item $p(\mbox{\bf x} \vert \omega_i)$ \begin{itemize} \item[] $\star$ class-conditional p.d.f.\footnote{This is also estimated from training data which will be explained later more in detail.} \begin{itemize} \item which describes the distribution of the feature vectors {\bf x} in each of the classes $\omega_i$. \end{itemize} \end{itemize} \end{itemize} \end{itemize} \clearpage \subsection*{II. Quadratic form in 2-dimensional space} You might be interested, first of all, in how points scattered are influenced by values in $\Sigma$, that is, $\sigma_1^2$, $\sigma_2^2$, and $\sigma_{12}=\sigma_{21}$. Let's observe here three different cases of $\Sigma$ when $\mu=(0, 0)$. \begin{small} \begin{eqnarray} \nonumber (1) \hspace{.1in}\Sigma_1 = \left( \begin{array}{cc} 0.20 & 0 \\ 0 & 0.20 \end{array} \right) \hspace{.1in} (2) \hspace{.1in}\Sigma_2 = \left( \begin{array}{cc} 0.1 & 0 \\ 0 & 0.9 \end{array} \right) \hspace{.1in} (3) \hspace{.1in}\Sigma_3 = \left( \begin{array}{cc} 0.5 & 0.3 \\ 0.3 & 0.2 \end{array} \right) \end{eqnarray} \end{small} \begin{figure}[bhtp] \begin{center} \includegraphics[width=5cm,height=5cm]{gaussian-2d-200-points-1.eps} \includegraphics[width=5cm,height=5cm]{gaussian-2d-200-points-2.eps} \includegraphics[width=5cm,height=5cm]{gaussian-2d-200-points-3.eps} %% \epsfile{file=gaussian-2d-200-points-1.eps,width=5cm,height=5cm} %% \epsfile{file=gaussian-2d-200-points-2.eps,width=5cm,height=5cm} %% \epsfile{file=gaussian-2d-200-points-3.eps,width=5cm,height=5cm} \end{center} \caption{A cloud of 200 points each with different three different $N[\mu, \Sigma]$.} \end{figure} \clearpage \subsection*{III. How to calculate inverse of 3-dimensional matrix.} \label{append} We now try to calculate the invers of the following 3-D matrix $A$ which appeared in the subsection \ref{subsec}. \begin{eqnarray} \nonumber A = \left( \begin{array}{ccc} 0.3 & 0.1 & 0.1 \\ 0.1 & 0.3 & -0.1 \\ 0.1 & -0.1 & 0.3 \end{array} \right) \end{eqnarray} \noindent We use a relation $A\mbox{{\bf x}} = I$ where $\mbox{{\bf x}} = (x, y, z)^T$ and $I$ is {\it identity matrix}, i.e., \begin{eqnarray} \nonumber \left( \begin{array}{ccc} 0.3 & 0.1 & 0.1 \\ 0.1 & 0.3 & -0.1 \\ 0.1 & -0.1 & 0.3 \end{array} \right) \left( \begin{array}{r} x \\ y \\ z \end{array} \right) = \left( \begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array} \right) \end{eqnarray} It remains identical if we multiply \{{\it 2nd-raw}\} by 3 and subtract the {\it \{1st-raw\}}, i.e., \begin{eqnarray} \nonumber \left( \begin{array}{ccc} 0.3 & 0.1 & 0.1 \\ 0 & 0.8 & -0.4 \\ 0.1 & -0.1 & 0.3 \end{array} \right) \left( \begin{array}{r} x \\ y \\ z \end{array} \right) = \left( \begin{array}{ccc} 1 & 0 & 0 \\ -1 & 3 & 0 \\ 0 & 0 & 1 \end{array} \right) \end{eqnarray} In the same way, but this time, we multiply \{{\it 3rd-raw}\} by 3 and subtract the {\it \{1st-raw\}}. \begin{eqnarray} \nonumber \left( \begin{array}{ccc} 0.3 & 0.1 & 0.1 \\ 0 & 0.8 & -0.4 \\ 0 & -0.4 & 0.8 \end{array} \right) \left( \begin{array}{r} x \\ y \\ z \end{array} \right) = \left( \begin{array}{ccc} 1 & 0 & 0 \\ -1 & 3 & 0 \\ -1 & 0 & 3 \end{array} \right) \end{eqnarray} Then, e.g., multiply the {\it \{1st-raw\}} by 8 and then subtract the {\it \{2nd-raw\}}: \begin{eqnarray} \nonumber \left( \begin{array}{ccc} 2.4 & 0 & 1.2 \\ 0 & 0.8 & -0.4 \\ 0 & -0.4 & 0.8 \end{array} \right) \left( \begin{array}{r} x \\ y \\ z \end{array} \right) = \left( \begin{array}{ccc} 9 & -3 & 0 \\ -1 & 3 & 0 \\ 0 & 0 & 3 \end{array} \right) \end{eqnarray} Multiply the {\it \{3rd-raw\}} by 2 and then add the {\it \{2nd-raw\}}: \begin{eqnarray} \nonumber \left( \begin{array}{ccc} 2.4 & 0 & 1.2 \\ 0 & 0.8 & -0.4 \\ 0 & 0 & 1.2 \end{array} \right) \left( \begin{array}{r} x \\ y \\ z \end{array} \right) = \left( \begin{array}{ccc} 9 & -3 & 0 \\ -1 & 3 & 0 \\ -1 & 3 & 6 \end{array} \right) \end{eqnarray} Subtract {\it \{3rd-raw\}} from the {\it \{1st-raw\}}: \begin{eqnarray} \nonumber \left( \begin{array}{ccc} 2.4 & 0 & 0 \\ 0 & 0.8 & -0.4 \\ 0 & 0 & 1.2 \end{array} \right) \left( \begin{array}{r} x \\ y \\ z \end{array} \right) = \left( \begin{array}{ccc} 10 & -6 & -6 \\ -1 & 3 & 0 \\ -1 & 3 & 6 \end{array} \right) \end{eqnarray} Multiply the {\it \{2nd-raw\}} by 3 then add the {\it \{3rd-raw\}}: \begin{eqnarray} \nonumber \left( \begin{array}{ccc} 2.4 & 0 & 0 \\ 0 & 2.4 & 0 \\ 0 & 0 & 1.2 \end{array} \right) \left( \begin{array}{r} x \\ y \\ z \end{array} \right) = \left( \begin{array}{ccc} 10 & -6 & -6 \\ -4 & 12 & 6 \\ -1 & 3 & 6 \end{array} \right) \end{eqnarray} Finally, devide the {\it \{1st-raw\}} by 2.4, devide the {\it \{2nd-raw\}} by 2.4, and devide the {\it \{3rd-raw\}} by 1.2, we obtain, \begin{eqnarray} \nonumber \left( \begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array} \right) \left( \begin{array}{r} x \\ y \\ z \end{array} %% 25/6 & -5/2 & -5/2 \\ %% -5/3 & 5 & 5/2 \\ %% -5/6 & 5/2 & 5 \right) = \frac{1}{30} \left( \begin{array}{ccc} 5 & -3 & -3 \\ -2 & 6 & 3 \\ -1 & 3 & 6 \end{array} \right) \end{eqnarray} Now we know the right-hand-side is the invers of $A$ because the equation implies $I \mbox{{\bf x}} = B$ and it holds $AI \mbox{{\bf x}} = AB$, that is, $A \mbox{{\bf x}} = AB$. Hence $AB = I$ which means $B=A^{-1}$.\\ %% 0.3 & 0.1 & 0.1 25 & -15 & -15 7.5-1.0-0.5 -4.5+3.0+1.5 -4.5+1.5+3.0 6 0 0 %% 0.1 & 0.3 & -0.1 -10 & 30 & 15 2.5-3.0+0.5 -1.5+9.0-1.5 -1.5+4.5-3.0 0 6 0 %% 0.1 & -0.1 & 0.3 -5 & 15 & 30 2.5+1.0-1.5 -1.5-3.0+4.5 -1.5-1.5+9.0 0 0 6 \noindent To make it sure, calculate and find \begin{eqnarray} \nonumber \left( \begin{array}{ccc} 0.3 & 0.1 & 0.1 \\ 0.1 & 0.3 & -0.1 \\ 0.1 & -0.1 & 0.3 \end{array} \right) \times \frac{1}{30} \left( \begin{array}{ccc} 5 & -3 & -3 \\ -2 & 6 & 3 \\ -1 & 3 & 6 \end{array} \right) \end{eqnarray} \begin{eqnarray} \nonumber = \frac{1}{6} \left( \begin{array}{ccc} 25 & -15 & -15 \\ -10 & 30 & 15 \\ -5 & 15 & 30 \end{array} \right) \times \left( \begin{array}{ccc} 0.3 & 0.1 & 0.1 \\ 0.1 & 0.3 & -0.1 \\ 0.1 & -0.1 & 0.3 \end{array} \right) = \left( \begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array} \right) \end{eqnarray} \noindent Therefore \begin{eqnarray} \nonumber A^{-1} = \frac{1}{30} \left( \begin{array}{ccc} 5 & -3 & -3 \\ -2 & 6 & 3 \\ -1 & 3 & 6 \end{array} \right) \end{eqnarray} \end{document} %%================================================================================================