\documentclass[12pt]{article}
%% \documentstyle[12pt,eclepsf]{article}

\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{rsfs}


\textheight 230mm
\topmargin 5mm
\textwidth 160mm
\oddsidemargin -2mm

\pagestyle{myheadings}
\markright{(Pattern Recognition)}

\newtheorem{formula}{Rule}
\newtheorem{example}{Example}
\newtheorem{excersize}{Excersize}
\newtheorem{algorithm}{Algorithm}

\begin{document}
\thispagestyle{empty}

\title{{\bf Pattern Classification} \\ 
        {\large for \\
        postgraduates \\
        (9 weeks from 21 February to 17 April, 2007)}}
\author{Akira Imada \\
Brest State Technical University, Belarus \\
        \hspace*{1in}\\
        \hspace*{1in}\\
        (last modified on)}
\maketitle

%% \noindent
%% (Not a big change this week, because my PC under FreeBSD, which I currently use to create
%% Figures, has not worked this week. Alas! Damn it!)

\clearpage
\begin{center}{\bf INDEX} \end{center}
\begin{itemize}
\item Introduction 
   \begin{itemize}
   \item What is Classification?
      \begin{itemize}
      \item Let's start with an example --- Female sermon and Male sermon.
      \end{itemize}
   \end{itemize}
\item Bayesian Classification -- I
      \begin{itemize}
      \item What if we know only prior density of each class?
      \item How conditional posterior distribution probability improves liklyhood of classes? 
      \item Bayesian Decision Theory.
      \item Classiffication by {\it likelyhood-ratio}.
      \end{itemize}
\item Bayesian Classification -- II
   \begin{itemize}
   \item When class-conditional distribution of obserbations is Gaussian pdf?
      \begin{itemize}
      \item One-dimensional case.
      \item Two-dimensional case.
      \item Higher-dimensional case.
      \end{itemize}
   \end{itemize}
\item Bayesian Classification -- III
   \begin{itemize}
   \item When we know class-conditional distribution $p(\mbox{\bf x} \vert \omega)$
        but don't know some of its parameters $\boldsymbol{\theta}$?
      \begin{itemize}
      \item Maximam likelihood parameter estimation.
      \item Bayesian parameter estimation.
      \end{itemize}
   \end{itemize}
\item Bayesian Classification -- IV 
   \begin{itemize}
   \item When we don't know know class-conditional distribution $p(\mbox{\bf x} \vert \omega)$?
      \begin{itemize}
      \item Parzen window approach.
      \item Probabilistic Neural Network implementaion of Parzen window approach.
      \item $k_m$-nearest neighbor approach.
      \end{itemize}
   \end{itemize}
\item Bayesian Belief Network.
\item Hidden Markov Model.
\item Fuzzy Logic Classification.
\end{itemize}

\clearpage
\section{An example to start with.}
In general, we need what {\it features} can be used for pattern classification
purpose. This is, of course, one of big issues in pattern classification, 
but here we assume that we already know {\it features} which
efficiently can be used to classify our target patterns.\\

\subsection{Which do you like better female or male?}
I mean not human, but salmon.
Well, we now assume, as an example, we want to classify salmons into female ones and male ones.
The situation is we have thousands of salmons and our task is classify salmon not yet seen into 
female and male only by observing one feature $x$ --- length of the salmon.


\subsection{If we only know prior probability.}
If we don't know any prior information except for how many so far we have were female and male.
Now we denote the class of female salmon as $\omega_1$ and the class of male salmon as $\omega_2$.
Then above mentioned ``how many so far we have were female and male'' are denoted as
$P(\omega_1)$ and $P(\omega_2)$, respectively, and called {\it prior probability}.\\

\noindent
In this case, all we should act is with the following rule.

\bigskip
\begin{formula}[Classification only with prior probability] \label{rule-1}
If $P(\omega_1) > P(\omega_2)$ then classify it to $\omega_1$ otherwise $\omega_2$.
\end{formula}

\noindent
This might seem to be a trivial reaction.


\subsection{If we also know posterior distribution of features.}
If we assume that we know further distribution of length $s$ in each of female and male salmon,
that is, $p(x \vert \omega_1)$ and $p(x \vert \omega_2)$, then our task of classify salmons
with an observation of $x$ is a little more sophisticated.\\

\noindent
Note here $p(\cdot)$ denotes a probability and $P(\cdot)$ denotes a density distribution.\\

\noindent
Instinctively, we might put the threshold $\theta$ as in Fig.~1.
If both classes are equally important then we put it at the center of those two distribution
functions. But sometimes we might put it in different way like in Fig.~2, 
if we take it into acount that the risk of misclassify $x$ into $\omega_2$ 
while $x$ is $\omega_2$ is important\footnote{In the case of salmon, 
we tend to be happier if the one we bought a male salmon with cheaper price 
than female due to eggs, was mistakenly female.}.\\

\begin{figure}[bhtp]\label{fig:well-separate-gaussian}
 \begin{center}
\includegraphics[width=15cm,height=5cm]{two-gaussian-2.eps}
%% \epsfile{file=two-gaussian-2.eps,width=15cm,height=5cm}
 \end{center}
 \caption{An example of two distribution of the observation $x$ w.r.t. each class of $\omega_1$ 
          and $\omega_2$. Here reader might image that distribution of length of female salmon 
          and male sarmon.}
\end{figure}

\begin{figure}[bhtp]
 \begin{center}
\includegraphics[width=15cm,height=5cm]{two-gaussian.eps}
%%   \epsfile{file=two-gaussian.eps,width=15cm,height=5cm}
 \end{center}
 \caption{Yet another example where two distributions are closer than the first example. 
          Two threshold are shown, considering a risk of miss-classifcation.}
\end{figure}

\noindent
In the example of Fig.~\ref{fig:well-separate-gaussian},
you might imagine {\it a classification fo human female and male by one feature ---
frequency of her/his voice}. In this case $p(x \vert \omega_1)$ and $p(x \vert \omega_1)$
would be well separated and easy to classify. 


%% set a dicision boundary so as to minimize error
%% error of misclassify salmon as sea-bass and vice versa
%% generalization vs over-fitting
%% minimum-error-rate
%% or
%% minimum cost(risk)rate
%% recognition for classification

%% Well the topic here is about sermon not human.\footnote{
%% {\small 剚剈剟剠剣}vs.{\small 剓剠剛剏剤刾} though {\small 剕剙剝剙剝剮 剓剈剟刾} vs.
%% {\small 剝剉剗剆刾} {\small 剓剈剟刾剅刾剳 剏剓剛刾}}

\section{What is Bayesian Classification?}
Assume we should guess the class that will come next with the knowledge of 
\begin{itemize}
\item[(1)] Prior probabilities of class-1 $\omega_1$ and class-2 $\omega_2$, that is,
           $P(\omega_1)$ and $P(\omega_2)$. 
\item[(2)] Class conditional probability of the observatin of $x$ of each of the two classes,
           that is, $p(x \vert \omega_1)$ and $p(x \vert \omega_2)$.
\end{itemize}
Then we can calculate $P(\omega_i) \vert x)$ 
---  the probability that class is $\omega_i$ under the condition where $x$ is observed as:

\begin{equation} \label{eq:001}
P(\omega_i \vert x) = \frac{p(x \vert \omega_i)P(\omega_i)}{p(x)} \hspace{.1in} (i=1, 2),
\end{equation}

\noindent
where 
\begin{equation} \label{eq:002}
p(x) = p(x \vert \omega_1)P(\omega_1) + p(x \vert \omega_2)P(\omega_2)
\end{equation}
which is sometimes called {\it evidence}. Then our rule is:\\


\bigskip
\begin{formula}[Classification with posterior probability]\label{rule-2}
If $P(\omega_1 \vert x) > P(\omega_2  \vert x)$
then classify it to $\omega_1$ otherwise $\omega_2$.
\end{formula}


\subsection{Bayesian Decision Theory}
Assume now we have multiple {\it features}:
\begin{displaymath}
x_1, x_2, \cdots, x_d
\end{displaymath}
mulitple {\it classes}:
\begin{displaymath}
\omega_1, \omega_2, \cdots, \omega_c
\end{displaymath}
also {\it actions}: 
\begin{displaymath}
\alpha_1, \alpha_2, \cdots, \alpha_a
\end{displaymath}
\noindent
instead of guessing which class.
Then loss function $\lambda(\alpha_i \vert \omega_j)$ is defined as
the loss when we take an action $alpha_i$ when the situation is $\omega_j)$.\\

\noindent
If we take an action $\alpha_i$ when we observe {\bf x} while the true situation is $\omega_j$,
the conditional {\it risk} is defiened as 
\begin{equation}
R(\alpha_i \vert \mbox{\bf x}) = 
\sum_{j=1}^{c}{\lambda(\alpha_i \vert \omega_j) p(\omega_j \vert \mbox{\bf x})}.
\end{equation}
Note that we want to take the action so as to minimize this risk.\\

%% \noindent
%% If we put it more generally
%% $\alpha(x)$ corresponds either of $\alpha_i$ i=1..a to 
%% which action to take for every possible observation $\Rightarrow$ bayes desision theory

\noindent
Under continuous feature $\mbox{\bf x}$ we have observed, assume
$c$ finite categories: ${\omega_1, \omega_2, ...\omega_c}$  and
$a$ finite possible actions ${\alpha_1, \alpha_2, ...\alpha_a}$.
Then
$a$ loss function $\lambda(\alpha_i \vert \omega_j)$ is defiened as 
the loss caused by taking an action
$\alpha_i$ when $\omega_j$ is the situation.\\

\noindent
We now assume that we take action $\alpha_i$ when we observe $\mbox{\bf x}$
while the true situation is $\omega_j$ the loss function
$\lambda(\alpha_i \vert \omega_j)$ will be defiend.
Then expected loss will be 
\begin{equation}
R(\alpha_i \vert \mbox{\bf x}) = \sum_{j=1}^{c}{\lambda(\alpha_i \vert \omega_j)
P(\omega_j \vert \mbox{\bf x})}
\end{equation}
which called a {\it risk}.
Recall here
\begin{equation}
P(\omega_j \vert \mbox{\bf x}) = \frac{p(\mbox{\bf x} \vert \omega_j)P(\omega_j)}
{p(\mbox{\bf x})}
\end{equation}
where
\begin{equation}
p(\mbox{\bf x}) = \sum_{j=1}^{c}{p(\mbox{\bf x} \vert \omega_j)P(\omega_j)}.
\end{equation}


\subsection{Two-category classification}
Let's simplify as $\lambda_{ij} = \lambda(\alpha_i \vert \omega_j)$, that is,

\begin{eqnarray}
\Lambda =
\left(
\begin{array}{cc}
\lambda_{11} & \lambda_{12}             \\
\lambda_{21} & \lambda_{22}
\end{array}
\right),
\end{eqnarray}


Then 
\begin{equation}
\nonumber
   R(\alpha_1 \vert \mbox{\bf x}) = \lambda_{11}P(\omega_1 \vert \mbox{\bf x})
      + \lambda_{12}P(\omega_2 \vert \mbox{\bf x})
\end{equation}
\begin{equation}
   R(\alpha_2 \vert \mbox{\bf x}) = \lambda_{21}P(\omega_1 \vert \mbox{\bf x})
      + \lambda_{22}P(\omega_2 \vert \mbox{\bf x})
\end{equation}
Here we have:
\begin{formula}[Rule for classification]
If $R(\alpha_1 \vert \mbox{\bf x}) < R(\alpha_1 \vert \mbox{\bf x})$
then classify $\mbox{\bf x} to $ $\omega_1$ otherwise to $\omega_2$.
\end{formula}


\subsubsection{likelihood ratio}
If we further assume that we have one feature $x$ instead of multiple $\mbox{\bf x}$,
the if part of the rule above will be
\begin{equation}
(\lambda_{21} - \lambda_{11})P(\omega_1 \vert x) > 
(\lambda_{12} - \lambda_{22})P(\omega_2 \vert x).
\end{equation}
And usually we can assume by definition that $\lambda_{21}>\lambda_{21}>$ and 
$\lambda_{21}>\lambda_{21}>$, the rule will be:
\begin{formula}[Rule for classification]
If the following holds 
\begin{equation}
\frac{p(x \vert \omega_1)}{p(x \vert \omega_2)}
> 
\frac{\lambda_{12} - \lambda_{22}}{\lambda_{21} - \lambda_{11}}\frac{P(\omega_2)}{P(\omega_1)}
\end{equation}
then classify $x$ to $\omega_1$ otherwise to $\omega_2$.
\end{formula}
Note that the left-hand side is a function of $x$ and is called a {\it likelihood ratio}
and the right-hand side is a quantity and is a {\it threshold} to discriminate two classes. 

\begin{figure}[bhtp]
 \begin{center}
\includegraphics[width=14cm,height=7cm]{liklihood-ratio.eps}
%%   \epsfile{file=liklihood-ratio.eps,width=14cm,height=7cm}
 \end{center}
\end{figure}


\subsection{Discriminant function}
Let's define a function, not probability or not density but just a function
for the purpose of discriminate the space in to categories.


\section{Examples of Decision Surface}
\subsection{1-D Gaussian case.}
We now assume
\begin{displaymath}
p(x \vert \omega_1) = \frac{1}{\sqrt{2\pi} \sigma_1}
\exp\{{-\frac{1}{2} \frac{(x-\mu_1)^2}{{\sigma_1}^2}}\}
\end{displaymath}
and
\begin{displaymath}
p(x \vert \omega_2) = \frac{1}{\sqrt{2\pi} \sigma_2}
\exp\{{-\frac{1}{2} \frac{(x-\mu_2)^2}{{\sigma_2}^2}}\}.
\end{displaymath}
Hence, one-dimensional version of our Rule-\ref{rule-2} is, 
\begin{center}{\it
If $P(\omega_1 \vert x) > P(\omega_2 \vert x)$ then classify x to $\omega_1$ otherwise $\omega_2$.
}
\end{center}
So, recalling 
\begin{displaymath}
P(\omega_i \vert x) = \frac{p(x \vert \omega_i) P(\omega_i)}{P(x)}
\end{displaymath}
and $P(x)$ is common for both $i=1$ and $i=2$,
the {\it IF-part} of the above rule is
\begin{displaymath}
\frac{1}{\sqrt{2\pi} \sigma_1}
\exp\{{-\frac{1}{2} \frac{(x-\mu_1)^2}{{\sigma_1}^2}}\}
P(\omega_1)
>
\frac{1}{\sqrt{2\pi} \sigma_2}
\exp\{{-\frac{1}{2} \frac{(x-\mu_2)^2}{{\sigma_2}^2}}\}
P(\omega_2)
\end{displaymath}
Now that we see all the terms in the equation are positive, the comparison holds true
if we take logarithm based on $e$ on both-hands of the equation.\footnote{That is,
\begin{displaymath}
A > B \Leftrightarrow \ln(A) > \ln(B).
\end{displaymath}}
Therefore
\begin{displaymath}
\ln\frac{1}{\sqrt{2\pi}} + 
\ln\frac{1}{\sigma_1} -
\frac{1}{2}\frac{(x-\mu_1)^2}{{\sigma_1}^2} +
\ln P(\omega_1)
>
\ln\frac{1}{\sqrt{2\pi}} + 
\ln\frac{1}{\sigma_2} -
\frac{1}{2}\frac{(x-\mu_2)^2}{{\sigma_2}^2} +
\ln P(\omega_2)
\end{displaymath}

\bigskip
\begin{excersize}
Obtain the decision boundary $x_0$ when the two classes follow the Gaussian
distributions with $N(1, \frac{1}{2})$ and $N(3, \frac{1}{2})$ 
respectively.
\end{excersize}

\subsection{2-D Gaussian case}
Recalling Baysian formula Eq. (\ref{eq:001})
\begin{displaymath}
P(\omega_i \vert \mbox{\bf x}) 
= \frac{p(\mbox{\bf x} \vert \omega_i)P(\omega_i)}{p(x)} \hspace{.1in} (i=1, 2),
\end{displaymath}
\noindent
where
$
p(\mbox{\bf x}) 
= p(\mbox{\bf x} \vert \omega_1)P(\omega_1) + p(\mbox{\bf x} \vert \omega_2)P(\omega_2)
$
is just a term to normalize $P(\omega_i \vert \mbox{\bf x})$ to 1. Further, we assume
$
P(\omega_1) = P(\omega_1) = 1/2
$
just for a sake of simplicity here. 
So our decision rule to determine
which class of $\omega_1$ or $\omega_2$ should be assigned to  {\bf x}:  
\begin{center}
{\it ``If $P(\omega_1 \vert \mbox{\bf x}) > P(\omega_2 \vert \mbox{\bf x})$ then 
{\bf x} is $\omega_1$, otherwise $\omega_2$'' }
\end{center}
just depends on  
$p(\mbox{\bf x} \vert \omega_i)$
which we assume to be Gaussian p.d.f. That is
\begin{equation} \label{eq: gaussian-multi-d}
p(\mbox{\bf x} \vert \omega_i) = 
\frac{1}{(2\pi)^{d/2} \vert \Sigma_i \vert ^{1/2}}
\exp\{- \frac{1}{2}(\mbox{\bf x} - \boldsymbol{\mu}_i)^t {\Sigma_i}^{-1} (\mbox{\bf x} - \boldsymbol{\mu}_i)\}
\end{equation}

\subsubsection{What will borders look like on what condition?}
Now that  we restrict our universe in two-dimensional space, we use
a notation $(x,y)$ instead of $(x_1,x_2)$. So we now express $\mbox{\bf x} = (x, y)$
Furthermore, both of our two classes are assumed to follow the Gaussian p.d.f. whose $\mu$ 
are $\boldsymbol{\mu}_1 = (0,0)$ and $\boldsymbol{\mu}_2 = (1,0)$,
and $\Sigma$ are
\begin{eqnarray}
\nonumber
\Sigma_1 =
\left(
\begin{array}{cc}
a_1 & 0             \\
0   & b_1
\end{array}
\right),
\hspace*{.2in}\Sigma_2 =
\left(
\begin{array}{cc}
a_2 & 0             \\
0   & b_2
\end{array}
\right)
\end{eqnarray}

\noindent
Under this simple condition, our inverse matrix is simply, 
$\vert \Sigma_1 \vert = a_1b_1$
and 
$\vert \Sigma_2 \vert = a_2b_2$. 
So, we now know
\begin{eqnarray}
\nonumber
\Sigma_1^{-1} = \frac{1}{a_1b_1}
\left(
\begin{array}{cc}
b_1 & 0             \\
0   & a_1
\end{array}
\right) =
\left( 
\begin{array}{cc}
1/a_1 & 0             \\
0   & 1/b_1
\end{array}
\right)
\end{eqnarray}
and in the same way
\begin{eqnarray}
\nonumber
\Sigma_2^{-1} = \frac{1}{a_2b_2}
\left(
\begin{array}{cc}
b_2 & 0             \\
0   & a_2
\end{array}
\right) =
\left( 
\begin{array}{cc}
1/a_2 & 0             \\
0   & 1/b_2
\end{array}
\right)
\end{eqnarray}

Now Eq. (\ref {eq: gaussian-multi-d}) is more specifically
\begin{eqnarray}
\nonumber
p(\mbox{\bf x} \vert \omega_1) = 
\frac{1}{2\pi \sqrt{a_1b_1}}
\exp\{- \frac{1}{2}
(x \hspace{.1in} y)
\left( 
\begin{array}{cc}
1/a_1 & 0             \\
0   & 1/b_1
\end{array}
\right)
\left( 
\begin{array}{c}
x       \\
y
\end{array}
\right)\}
\end{eqnarray}
and 
\begin{eqnarray}
\nonumber
p(\mbox{\bf x} \vert \omega_2) = 
\frac{1}{2\pi \sqrt{a_2b_2}}
\exp\{- \frac{1}{2}
(x-1 \hspace{.1in} y)
\left( 
\begin{array}{cc}
1/a_2 & 0             \\
0   & 1/b_2
\end{array}
\right)
\left( 
\begin{array}{c}
x-1       \\
y
\end{array}
\right)\}
\end{eqnarray}
Then we can define our discriminat function $g_i(\mbox{\bf x}$ $i = 1, 2$ taking 
{logarithm based natural number $e$} as
\begin{eqnarray}
\nonumber
g_1(\mbox{\bf x}) = 
- \frac{1}{2}
(x \hspace{.1in} y)
\left( 
\begin{array}{cc}
1/a_1 & 0             \\
0   & 1/b_1
\end{array}
\right)
\left( 
\begin{array}{c}
x       \\
y
\end{array}
\right)
+ \ln (2\pi) + \frac{1}{2} \ln(a_1b_1)
\end{eqnarray}
and 
\begin{eqnarray}
\nonumber
g_2(\mbox{\bf x}) = 
- \frac{1}{2}
(x-1 \hspace{.1in} y)
\left( 
\begin{array}{cc}
1/a_2 & 0             \\
0   & 1/b_2
\end{array}
\right)
\left( 
\begin{array}{c}
x-1       \\
y
\end{array}
\right)
+ \ln (2\pi) + \frac{1}{2} \ln(a_2b_2)
\end{eqnarray}
Neglecting here the common term for both equation $\ln(2\pi)$, our new discriminant functions are
\begin{displaymath}
g_1(\mbox{\bf x}) = -\frac{1}{2}\{\frac{x^2}{a_1}+\frac{y^2}{b_1}\} + \frac{1}{2} \ln(a_1b_1)
\end{displaymath}
and
\begin{displaymath}
g_2(\mbox{\bf x}) = -\frac{1}{2}\{\frac{(x-1)^2}{a_2}+\frac{y^2}{b_2}\} + \frac{1}{2} \ln(a_2b_2)
\end{displaymath}
Finally, we obtain the border equation from $g_1(\mbox{\bf x}) - g_2(\mbox{\bf x}) = 0$.

\begin{equation}\label{eq:two-d-gaussian-final}
(\frac{1}{a_1}-\frac{1}{a_2})x^2 + \frac{2}{a_2}x + (\frac{1}{b_1}-\frac{1}{b_2})y^2
= \frac{1}{a_2} + \ln \frac{a_1b_1}{a_2b_2}
\end{equation}

\noindent
We now know that the shape of the border will be either of the following five cases:
(i) straight line
(ii) circle;
(iii) ellipse;
(iV) parabola;
(v) hyperbola;
(Vi) two straight lines,
depending on how the points distribute, that is,
depending on $a_1$, $b_1$, $b_1$ and $b_2$ in our situation above.


\subsubsection{Examples}
Let's try some calculations under the above assumptions, that is,

\begin{eqnarray}
\nonumber
(1) \hspace{.2in}\Sigma_1 =
\left(
\begin{array}{cc}
0.10 & 0             \\
0    & 0.10
\end{array}
\right),
\hspace{.5in}\Sigma_2 =
\left(
\begin{array}{cc}
0.10 & 0             \\
0    & 0.10
\end{array}
\right)
\end{eqnarray}

\begin{eqnarray}
\nonumber
(2) \hspace{.2in}\Sigma_1 =
\left(
\begin{array}{cc}
0.10 & 0             \\
0    & 0.10
\end{array}
\right),
\hspace{.5in}\Sigma_2 =
\left(
\begin{array}{cc}
0.20 & 0             \\
0    & 0.20
\end{array}
\right)
\end{eqnarray}

\begin{eqnarray}
\nonumber
(3) \hspace{.2in}\Sigma_1 =
\left(
\begin{array}{cc}
0.10 & 0             \\
0    & 0.15
\end{array}
\right),
\hspace{.5in}\Sigma_2 =
\left(
\begin{array}{cc}
0.20 & 0             \\
0    & 0.25
\end{array}
\right)
\end{eqnarray}

\begin{eqnarray}
\nonumber
(4) \hspace{.2in}\Sigma_1 =    %% In my PC this is not 4th but 5th.
\left(
\begin{array}{cc}
0.10 & 0             \\
0    & 0.15
\end{array}
\right),
\hspace{.5in}\Sigma_2 =
\left(
\begin{array}{cc}
0.15 & 0             \\
0    & 0.10
\end{array}
\right)
\end{eqnarray}

\begin{eqnarray}
\nonumber
(5) \hspace{.2in}\Sigma_1 =    %% In my PC this is not 5th but 7th.
\left(
\begin{array}{cc}
0.10 & 0             \\
0    & 0.20
\end{array}
\right),
\hspace{.5in}\Sigma_2 =
\left(
\begin{array}{cc}
0.10 & 0             \\
0    & 0.10
\end{array}
\right)
\end{eqnarray}


\noindent
The example the below is somewhat tricky. I wanted an example in which the right-hand side
of the equation (\ref{eq:two-d-gaussian-final} becames zero and the left-hand side
is a product of one-order equations of $x$ and $y$. As you might know, this is the case
where border is made up of two straight lines.
\begin{eqnarray}
\nonumber
(6) \hspace{.2in}\Sigma_1 =    %% In my PC this is not 5th but 7th.
\left(
\begin{array}{cc}
2e & 0             \\
0    & 0.5
\end{array}
\right),
\hspace{.5in}\Sigma_2 =
\left(
\begin{array}{cc}
1    &  0             \\
0    &  1
\end{array}
\right)
\end{eqnarray}

\noindent
My quick calculation tentatively results in as follows. See also the Figure below.
\begin{itemize}
\setlength{\itemsep}{0pt}
\item[(1)] $2x                              = 1              $
\item[(2)] $5 (x + 1)^2        + 5      y^2 = 10 - \ln4      $ 
\item[(3)] $5(x + 1)^2         + (8/3)  y^2 = 10 - \ln (10/3)$
\item[(4)] $5(x + 1)^2         - (10/3) y^2 = 10             $
\item[(5)] $20x                - 5      y^2 = 10 - \ln 2     $
\item[(6)] $(1 - 1/2e) x^2 - x -        y^2 = 0              $
\end{itemize}

\begin{figure}[bhtp] \label{fig:2d-ex}
 \begin{center}
\includegraphics[width=5cm,height=5cm]{two-d-example-1.eps}
\includegraphics[width=5cm,height=5cm]{two-d-example-2.eps}
\includegraphics[width=5cm,height=5cm]{two-d-example-3.eps}
\includegraphics[width=5cm,height=5cm]{two-d-example-6.eps}
\includegraphics[width=5cm,height=5cm]{two-d-example-5.eps}
\includegraphics[width=5cm,height=5cm]{two-d-example-4.eps}
%%   \epsfile{file=two-d-example-1.eps,width=5cm,height=5cm}
%%   \epsfile{file=two-d-example-2.eps,width=5cm,height=5cm}
%%   \epsfile{file=two-d-example-3.eps,width=5cm,height=5cm}
%%   \epsfile{file=two-d-example-5.eps,width=5cm,height=5cm}
%%   \epsfile{file=two-d-example-6.eps,width=5cm,height=5cm}
%%   \epsfile{file=two-d-example-4.eps,width=5cm,height=5cm}
 \end{center}
 \caption{A cloud of 100 points each extracted from a set of two classes 
and border of the two classes calculated on six different conditions.
(Results of (5) and (6) are still fishy and under another trial.)}
\end{figure}

\noindent
The final example in this sub-section is more general 2-dimensional case, but
(artificially) devised so that calculations won't become very complicated. 
We now assume $\boldsymbol{\mu}_1 = (0,0)$ and $\boldsymbol{\mu}_2 = (1,1)$, and we both classes share
the same $\Sigma$:
\begin{eqnarray}
\nonumber
(7) \hspace{.2in}\Sigma_1 =
\left(
\begin{array}{cc}
2.0 & 0.5             \\
0.5 & 2.0
\end{array}
\right),
\hspace{.5in}\Sigma_2 =
\left(
\begin{array}{cc}
5.0 & 4.0             \\
4.0 & 5.0
\end{array}
\right)
\end{eqnarray}

%% try (rotated hyperbola)
%% (0,0)  (1,1)
%% 2.0 0.5   5 4
%% 0.5 2.0   4 5 

%% \clearpage
\subsection{3-D Gaussian case}\label{subsec}
Here we study only one example. We assume two classes where 
$P(\omega_1) = P(\omega_1) = 1/2$. In each class, the patterns
are distributed with Gaussian p.d.f both have the same covariance matrix
\begin{eqnarray}
\nonumber
\Sigma =
\left(
\begin{array}{ccc}
 0.3 &  0.1 &  0.1      \\
 0.1 &  0.3 & -0.1      \\
 0.1 & -0.1 &  0.3
\end{array}
\right)
\end{eqnarray}

\noindent
and means of the distribution are $(0, 0, 0)^T$ and $(1, 1, 1)^T$. 
We now take a look at what our discriminat function
\begin{eqnarray}\label{eq:2-d-gaussian}
g_i(\mbox{\bf x})  
 &=& - \frac{1}{2} (\mbox{\bf x}- \boldsymbol{\mu}_i)^T
            \Sigma_i^{-1}(\mbox{\bf x}- \boldsymbol{\mu}_i)%% + \ln p(\omega_i) + c_i
\end{eqnarray}
\noindent
leads to? \\

\noindent 
Since we calculate  (see APPENDIX III.)
\begin{eqnarray}
\nonumber
\Sigma^{-1}
= \frac{1}{30}
\left(
\begin{array}{ccc}
 5  & -3 & -3       \\
-2  &  6 &  3       \\
-1  &  3 &  6
\end{array}
\right)
\end{eqnarray}

\noindent
Now our discriminat equation $g_1(\mbox{\bf x}) = g_2(\mbox{\bf x})$ is

\begin{eqnarray}
\nonumber
(x_1 x_2 x_3)
\left(
\begin{array}{ccc}
 5  & -3 & -3       \\
-2  &  6 &  3       \\
-1  &  3 &  6
\end{array}
\right)
\left(
\begin{array}{c}
x_1      \\
x_2      \\
x_3  
\end{array}
\right) = \hspace{1.5in}
\end{eqnarray}
\begin{eqnarray}
\nonumber
\hspace*{1.5in} 
((x_1-1) (x_2-1) (x_3-1))
\left(
\begin{array}{ccc}
 5  & -3 & -3       \\
-2  &  6 &  3       \\
-1  &  3 &  6
\end{array}
\right)
\left(
\begin{array}{c}
x_1-1 \\
x_2-1 \\
x_3-1
\end{array}
\right)
\end{eqnarray}

\noindent
Further caluculation leads to

\begin{eqnarray}
\nonumber
((5x_1-2x_2-x_3)(-3x_1+6x_2+3x_3))
\left(
\begin{array}{c}
x_1      \\
x_2      \\
x_3  
\end{array}
\right)=
\end{eqnarray}
\begin{eqnarray}
\nonumber
%%((  (x_1-1)-2(x_2-1)- (x_3-1)) 
%%( -3(x_1-1)+6(x_2-1)+3(x_3-1))
%%( -3(x_1-1)+3(x_2-1)+6(x_3-1)
(( 5x_1-2x_2- x_3-2) 
 (-3x_1+6x_2+3x_3-6)
 (-3x_1+3x_2+6x_3-6))
\left(
\begin{array}{c}
x_1-1      \\
x_2-1      \\
x_3-1  
\end{array}
\right)
\end{eqnarray}

\noindent
All the 2nd-order terms are cancelled and we obtain,
\begin{displaymath}
7x_1 + 13x_2 - 20x_3 = 14
\end{displaymath}
We now know that it is the plane which discriminates two of these classes $\omega_1$ and $\omega_2$.


\subsection{A Higher order Gaussian case}
The Equation 
\begin{eqnarray}
\nonumber
g_i(\mbox{\bf x}) 
 &=& \ln (p(\mbox{\bf x} \vert \omega_i) + \ln P(\omega_i))\\
\end{eqnarray}
still holds, of course. Now let's recall that the Gaussoian pdf is

\begin{equation}
p(\mbox{\bf x} \vert \omega) = \frac{1}{{(2\pi)}^{d/2} {\vert {\Sigma} \vert}^{1/2}}
\exp \{-\frac{1}{2}(\mbox{\bf x}-\boldsymbol{\mu})^t \Sigma^{-1}(\mbox{\bf x}-\boldsymbol{\mu})\}
\end{equation}

\noindent
and as such
\begin{equation} \label{eq:discriminant-fn}
g_i(\mbox{\bf x}) 
= -\frac{1}{2}(\mbox{\bf x}- \boldsymbol{\mu}_i)^t \Sigma_i^{-1}(\mbox{\bf x}-\boldsymbol{\mu}_i)
- \frac{d}{2}\ln(2\pi)-\frac{1}{2}\vert \Sigma_i \vert + \ln P(\omega_i)
\end{equation}

\noindent
We know take a look at cases which simplify situation more or less.\\


\bigskip
\subsubsection*{$\bullet$ When $\Sigma_i = \sigma^2 I$}

\noindent
In this case, it's easy to guess samples fall in equal diameter hypershpers.
Note, first of all
$\vert \Sigma_i \vert = \sigma^{2d}$ and $\Sigma_i^{-1} = (1/\sigma^2)I$.
So, we assume $g_i(\mbox{\bf x})$ here to be
\begin{equation}
g_i(\mbox{\bf x}) = - \frac{\| \mbox{\bf x} - \boldsymbol{\mu}_i \| ^2}{2\sigma^2} + \ln P(\omega_i)
\end{equation}
or, equivalently
\begin{equation} \label{eq:discrimitnant}
g_i(\mbox{\bf x}) 
= - \frac{1}{2\sigma^2} 
({\mbox{\bf x}}^t {\mbox{\bf x}} - 2\boldsymbol{\mu}_i^t {\mbox{\bf x}} + \boldsymbol{\mu}_i^t\boldsymbol{\mu}_i)
+ \ln P(\omega_i)
\end{equation}
Neglecting the terms those no affecting to the relation
$g_i(\mbox{\bf x})-g_j(\mbox{\bf x})= 0$ our $g_i(\mbox{\bf x})$ is now
\begin{equation}
g_i(\mbox{\bf x}) 
= \frac{1}{\sigma^2}
\boldsymbol{\mu}_i^t {\mbox{\bf x}}
- \frac{1}{2\sigma_i^2} \boldsymbol{\mu}_i^t\boldsymbol{\mu}_i + \ln P(\omega_i)
\end{equation}
Then $g_i(\mbox{\bf x})-g_j(\mbox{\bf x})=0$ leads to
\begin{equation}
\frac{1}{\sigma^2}
(\boldsymbol{\mu}_i - \boldsymbol{\mu}_i)^t {\mbox{\bf x}}
- \frac{1}{2\sigma_i^2} (\| \boldsymbol{\mu}_i \|^2 - \| \boldsymbol{\mu}_j \|^2) + \ln \frac{P(\omega_i)}{P(\omega_j)} = 0
\end{equation}

\noindent
In this case our classification rule will be\\

\bigskip
\begin{formula}[Minimum Distance Classification]
Measure Euclidean distance $\| \mbox{\bf x} - \boldsymbol{\mu} \|$ for $\forall i$,
then classify $\mbox{\bf x}$ to the class whose mean is nearest to  $\mbox{\bf x}$.
\end{formula}


\noindent
If we carefully modify Eq. (\ref{eq:discrimitnant}) we will obtain
\begin{equation}\label{eq:100}
\mbox{\bf w}\cdot(\mbox{\bf x} - \mbox{\bf x}_0)
\end{equation}

\bigskip
\begin{excersize}
Derive the Eq. (\ref{eq:100}) specifying $\mbox{\bf w}$ and ${\mbox{\bf x}}_0$.
\end{excersize}

\bigskip
\noindent
Eq. (\ref{eq:100}) is the equation which can be interpret as
\begin{center}{\it
``A hyperplane through ${\mbox{\bf x}}_0$ perpendicular to $vec{w}$.''
}\end{center}


\subsubsection*{$\bullet$ When all $\Sigma_i = \Sigma$}
\noindent
This condition implies that the patterns in each of both classes
distribute like hyper-elipsoid. 
Now that our discriminat function is
\begin{displaymath}
g_i(\mbox{\bf x}) = -\frac{1}{2}(\mbox{\bf x}-\boldsymbol{\mu})^t \Sigma^{-1}(\mbox{\bf x}-\boldsymbol{\mu})
\end{displaymath}

\noindent
We again obtain

\begin{displaymath}
\mbox{\bf w}\cdot(\mbox{\bf x} - \mbox{\bf x}_0) = 0
\end{displaymath}

\noindent
where $\mbox{\bf w}$ and $\mbox{\bf x}_0$ are

\begin{equation}
\mbox{\bf w} = \Sigma^{-1}(\boldsymbol{\mu}_i - \boldsymbol{\mu}_j)
\end{equation}
and 
\begin{equation}
\mbox{\bf x}_0 = \frac{1}{2}(\boldsymbol{\mu}_i + \boldsymbol{\mu}_j)
- \frac{\ln P(\omega_i)/P(\omega_j)}
{(\boldsymbol{\mu}_i - \boldsymbol{\mu}_j)^t \Sigma^{-1} (\boldsymbol{\mu}_i - \boldsymbol{\mu}_j)}
\end{equation}

\noindent
Notice here that $\mbox{\bf w}$ is no more parpendicular to the  direction between 
$\boldsymbol{\mu}_i$ and $\boldsymbol{\mu}_j$.

\bigskip
\begin{excersize}
Derive $\mbox{\bf w}$ and ${\mbox{\bf x}}_0$ above.
\end{excersize}

\bigskip
\noindent
So we modify the avove rule to

\bigskip
\begin{formula}[Classification by Mahalanobis distance]
Assign {\bf x} to $\omega_i$ in which Mahalanobis distance from $\boldsymbol{\mu}_i$
is minimum for $\forall i$.
\end{formula}

\bigskip
\noindent
Yes! This {\it Mahalanobis distance} between {\bf a} and {\bf b}
is defined as 
\begin{equation}
(\mbox{\bf a} - \mbox{\bf b})^t \Sigma^{-1} (\mbox{\bf a}- \mbox{\bf b})
\end{equation}


\noindent
The final example in this sub-section is more general 2-dimensional case, but
(artificially) devised so that calculations won't become very complicated. 
We now assume $\boldsymbol{\mu}_1 = (0,0)$ and $\boldsymbol{\mu}_2 = (1,0)$, and we both classes share
the same $\Sigma$:
\begin{eqnarray}
\nonumber
(7) \hspace{.2in}\Sigma_1 =
\left(
\begin{array}{cc}
1.1 & 0.3             \\
0.3 & 1.9
\end{array}
\right),
\hspace{.5in}\Sigma_2 =
\left(
\begin{array}{cc}
1.1 & 0.3             \\
0.3 & 1.9
\end{array}
\right)
\end{eqnarray}


\bigskip
\subsubsection*{$\bullet$ When all $\Sigma_i$'s are arbitrary}
\noindent
When no such restriction as above to simplify situation, the discriminant function is

\begin{displaymath}
g_i(\mbox{\bf x}) = -\frac{1}{2}(\mbox{\bf x}-\boldsymbol{\mu})^t \Sigma_i^{-1}(\mbox{\bf x}-\boldsymbol{\mu})
-\frac{d}{2} \ln 2\pi - \frac{1}{2} \ln \vert \Sigma_i \vert + ln P(\omega_i)
\end{displaymath}

Only the term we can neglect now is $(d/2) \ln 2\pi$. We now apply the identity
\begin{displaymath}
(\mbox{\bf x} - \mbox{\bf y})^t A (\mbox{\bf x} - \mbox{\bf y})
= \mbox{\bf x}^t A \mbox{\bf x} - 2 (A \mbox{\bf y})^t\mbox{\bf x} 
+ \mbox{\bf y}^t A \mbox{\bf y}.
\end{displaymath}

Then, we get the following renewed discriminant function

\begin{equation}
g_i(\mbox{\bf x}) = \mbox{\bf x}^t W_i \mbox{\bf x} + \mbox{\bf w}_i^t \mbox{\bf x} + w_{i0}
\end{equation}

where

\begin{displaymath}
W_i = - \frac{1}{2} \Sigma_i^{-1}
\end{displaymath}

\begin{displaymath}
\mbox{\bf w}_i = \Sigma_i^{-1}\boldsymbol{\mu}_i
\end{displaymath}
\begin{displaymath}
w_{i0} = -\frac{1}{2} \boldsymbol{\mu}_i^t \Sigma_i^{-1} \boldsymbol{\mu}_i
- \frac{1}{2} \ln \vert \Sigma_i \vert + \ln P(\omega_i).
\end{displaymath}

\bigskip
\noindent
Hence, $g_i(\mbox{\bf x}) - g_j(\mbox{\bf x}) = 0$ leads us to 
a {\it hyper quadratic form}. Or, if you want,  we can express it as

\bigskip
\begin{displaymath}
(a_1 x_1 + a_2 x_2 + \cdots + a_n x_n)(b_1 x_1 + b_2 x_2 + \cdots + b_n x_n) = const.
\end{displaymath}

\bigskip
\noindent
Namely, the border is either of 
(i) Hyper-planes;
(ii) a pair of hyper-planes;
(iii) hyper-sphere;
(iv) yyper-ellipsoid;
(v) hyper paraboloid;
(vi) hyper-hyperboloid.


\subsubsection{2-D cases revisit}

%%3,5  5,3

%%0.5 0.3
%%0.3 0.4


\clearpage
\section{Parameter Estimation}
Goal is to know $p(\omega_i \vert \mbox{\bf x})$.


\noindent
Assume now that we know the density functin for some reason to believe.
What we don't know is the parameter of the density function.\\

\noindent
To get a bird's eye view, we name a few of 1-D examples, besides Gaussian, 
of such density functions which need to be specified by some parameters.\\

\begin{itemize}
\setlength{\itemsep}{0pt}
\item Uiform-distribution, 
         \begin{itemize}
         \item $p(x) = 1/\theta $ ... if $x>0$ otherwise 0
         \end{itemize}
\item Exponential
         \begin{itemize}
         \item $f(x) = \theta \exp(-\theta x)$ ... if $x>0$ otherwise 0
         \end{itemize}
\item Rayleigh (See Figure below.)
         \begin{itemize}
         \item $f(x) = 2\theta x \exp(-\theta x^2)$ ... if $x>0$ otherwise 0
         \end{itemize}
\item Poisson
         \begin{itemize}
         \item $f(x) = (\theta^x/x!)\exp(- \theta)$ ...  for $x=0, 1, 2, \cdots$
         \end{itemize}
\item Bernoulli
         \begin{itemize}
         \item $f(x) = \theta^x (1 - \theta)^{1-x}$ ...  for $x=0,1$
         \end{itemize}         
\item Binomial
         \begin{itemize}
         \item $f(x) = (m!/x!(m-x)! \cdot \theta^x (1-\theta)^{m-x}$ ... for $x = 0, 1, \cdots, m$
         \end{itemize}
\end{itemize}

\begin{figure}[bhtp]
 \begin{center}
\includegraphics[width=5cm,height=5cm]{rayleigh.eps}
%%   \epsfile{file=rayleigh.eps,width=5cm,height=5cm}
 \end{center}
 \caption{Rayleigh distribution function with four different values of $\theta$.}
\end{figure}


\noindent
All we don't know is its parameters to specify the function. 
But we have an information of prior probability  $p(\theta)$
and training sample give us $p(\theta \vert D)$ which we expect to have a sharp peak
at the true $\theta$. \\


\subsection{Maximum Likelihood Estimation}
Before entering this topic, try the following case where we have two classes each
of which has only 4 training samples.

\bigskip
\begin{example}
Assuming in a 2-dimensional space,
what if we have a pair of 4 samples from each of two classes? 
The patterns we have are $(8,3), (4,3), (6,2), (6,4)$ from $\omega_1$0
and $(0,3), (-2,1), (-4,3), (-2,5)$ from $\omega_2$.
Try to guess the border between two classes, and then classify a new point $(5,3)$, for example.
\end{example}

\bigskip
\noindent
We now further more simplify the situation.\\

\bigskip
\begin{example}
Our data is now $x_1 = 3$, $x_2 = 8$, $x_3 = 2$ and $x_4 = 5$. Guess what distribution
function of these 4 data follows?
\end{example}

\bigskip
\noindent
This might be a Gaussian distribution but just a brief look at it suggests
it more like a Rayleigh distribution.

\bigskip
%% 1, 2, 4, 7

%% rayleigh
%% 0.095, 0.164, 0.180, 0.060  when theta = 0.05
%% 0.181, 0.268, 0.162, 0.010  when theta = 0.10

\begin{figure}[bhtp]
 \begin{center}
\includegraphics[width=5cm,height=5cm]{rayleigh-2.eps}
\includegraphics[width=5cm,height=5cm]{rayleigh-3.eps}
%%   \epsfile{file=rayleigh-2.eps,width=5cm,height=5cm}
%%   \epsfile{file=rayleigh-3.eps,width=5cm,height=5cm}
 \end{center}
 \caption{A cloud of 200 points each with different three different $N[\mu, \Sigma]$.}
\end{figure}


\bigskip
\begin{excersize}
(1) We now assume Gaussian with $\sigma = 1$ with $\mu$ being unknown. 
Then estimate $\mu$ from the same 4 data, i.e. 
$x_1 = 3$, $x_2 = 8$, $x_3 = 2$ and $x_4 = 5$, in the same way as above.
(2) Next, create 10, 20, 100 data randamly from $N(1, 3)$, 
and plot $p(D \vert \mu)$ as a function of $\mu$ in each of the three cases. 
\end{excersize}


\begin{formula}[An assamption of independence of data]
Take $\theta$ which maximizes
\begin{equation}
P(D \vert \theta) = \prod_{k=1}^{n} P(x_k \vert \theta)
\end{equation}
\end{formula}

\noindent
In short, choose $\theta$ which maximizes
\begin{equation}
P(D \vert \theta) = \prod_{k=1}^{n} p(x_k \vert \theta)
\end{equation}

\noindent
Then let's calculate our previous example of one-dimensional four data 
$D = \{2, 3, 5, 8\}$ by changign $\theta$ from 0.00 to 1.00 with an interval
of 0.01. The results are shown the left-most graph of the Fig.~\ref{fig:ML}. 
It might be interesting what if we have more data to estimate. 
The same procedures are made with the number of data 10, 20, 100 and the results
are shown in the same Figure. Important thing is is
\begin{center}{\it
``The more data we have, the sharper the graph becomes.''
}
\end{center}
\begin{figure}[bhtp]\label{fig:ML}
 \begin{center}
\includegraphics[width=3cm,height=3cm]{ml-rayleigh.eps}
\includegraphics[width=3cm,height=3cm]{ml-rayleigh-10.eps}
\includegraphics[width=3cm,height=3cm]{ml-rayleigh-20.eps}
\includegraphics[width=3cm,height=3cm]{ml-rayleigh-100.eps}
%%   \epsfile{file=ml-rayleigh.eps,width=3cm,height=3cm}
%%   \epsfile{file=ml-rayleigh-10.eps,width=3cm,height=3cm}
%%   \epsfile{file=ml-rayleigh-20.eps,width=3cm,height=3cm}
%%   \epsfile{file=ml-rayleigh-100.eps,width=3cm,height=3cm}
 \end{center}
 \caption{A cloud of 200 points each with different three different $N[\mu, \Sigma]$.}
\end{figure}


\begin{figure}[bhtp]
 \begin{center}
\includegraphics[width=10cm,height=5cm]{gaussian.eps}
%%   \epsfile{file=gaussian.eps,width=10cm,height=5cm}
 \end{center}
 \caption{A cloud of 200 points each with different three different $N[\mu, \Sigma]$.}
\end{figure}


\begin{figure}[bhtp]
 \begin{center}
\includegraphics[width=10cm,height=5cm]{uniform.eps}
%%   \epsfile{file=uniform.eps,width=10cm,height=5cm}
 \end{center}
 \caption{A cloud of 200 points each with different three different $N[\mu, \Sigma]$.}
\end{figure}

\subsection{Bayesian parameter estimation}
Since we view here the parameter $\boldsymbol{\theta}$ as random variables with an information
of prior distribution $p(\boldsymbol{\theta})$, we view $P(\boldsymbol{\theta} \vert D)$ 
as a function of $\boldsymbol{\theta}$.  As before, we assume

\begin{equation}
P(D  \vert \theta) = \prod_{k=1}^{n} p(x_k \vert \theta)
\end{equation}

\noindent
where we denote the training data 
\begin{displaymath}
D_n = \{x_1, x_2, \cdots, x_n\}
\end{displaymath}

\noindent
which implies
\begin{displaymath}
P(D_n \vert \theta) = P(x_n \vert \theta) P(D_{n-1} \vert \theta)
\end{displaymath}

\noindent
Therefore the Baysean Formula 
\begin{equation}
p(\theta \vert D) \approx P(D \vert \theta) P(\theta).
\end{equation}
leads us like

\begin{eqnarray}
\nonumber
P(\theta \vert D_n) 
\nonumber
 &=& P(D_n) \vert \theta)P(\theta)           \\
\nonumber
 &=& P(x_n \vert \theta) P(D_{n-1}) \vert \theta)P(\theta)           \\
\nonumber
 &=& P(x_n \vert \theta) P(\theta \vert D_{n-1})
\end{eqnarray}


\clearpage
\section{Non-paremetric density estimation}
Our assumption: so far are
\begin{itemize}
\item [(1)] density function $p(\mbox{\bf x} \vert \omega)$ is known
\item [(2)] high dimensional density function can be represented 
            as the product of one-dimensional function
\end{itemize}

\noindent
Now let's make a more realistic assamption, "We don't know the distribution of samples, i.e.,
we don't know the form of the density function.
Let's estimate $p(\mbox{\bf x} \vert \omega)$ from our sample patterns.
Or, let's estimate $P(\omega \vert \mbox{\bf x})$ directly from our sample patterns.
The typical example is {\it Nearest neighbor rule.}


\section{Unknown Density function Estimation}
One of the most elementary formula in probability theory tells us
the probability that $\mbox{\bf x}$ will fall in region $R$  will be
\begin{equation}\label{eq:elementary-prob}
P= \int_{R} p(\mbox{\bf x}) d \mbox{\bf x}
\end{equation}
Recall, if it's in 1-dimensional case, the probability that $x$ will fall in the region
$[x_1, x_2]$ will be 
\begin{displaymath}
P= \int_{x_1}^{x_2} p(x) dx
\end{displaymath}

\noindent
The probability of $k$ samples out of $n$ samples of $\{x_1, x_2, \cdots, x_n\}$
will fall in the region $R$ is 

\begin{equation}\label{eq:binominal}
{n \atopwithdelims() k} P^k(1-P)^{n-k}
\end{equation}

\noindent
So, we can assume
\begin{equation}
k = nP
\end{equation}
holds. On the other hand, Eq. (\ref{eq:elementary-prob}) can be 
\begin{equation}
P = p(\mbox{\bf x}) = V
\end{equation}
where $V$ is the volume of region $R$. This is true on condition that $V$ is small enough.
If you want to think of in 1-dimensional space this is
\begin{displaymath}
p(x)\cdot (x_2-x_1)
\end{displaymath}
on condition that $x_2-x_1 \approx 0$.
Hence, we now have a relation
\begin{equation}
p(x) = \frac{k/n}{V}.
\end{equation}


\bigskip
\begin{excersize}
Calculate the Eq (\ref{eq:binominal})above in each case of, say, $n=4, 20, 100$ and plot 
the probability as a function of $k/n$ assuming real $P$ is, say, 0.6.
Make sure the larger the value of $n$ the sharper the peak.
\end{excersize}

\noindent
Now in order to estimate the density $p(\mbox{\bf x})$ at {\bf x}, let's creat a sequence of 
region $R_1, R_2, R_3, \cdots$ each of which is a region where estimation is made 
with $n$ samples.

Now our fundamental Equation becomes
\begin{equation}
p_n(\mbox{\bf x}) = \frac{k_n/n}{V_n}
\end{equation}

\noindent
The condition for this to converge is

\begin{equation}
\lim_{n \to \infty} k_n = \infty
\end{equation}
\begin{equation}
\lim_{n \to \infty} V_n = 0
\end{equation}
\begin{equation}
\lim_{n \to \infty} k_n/n = 0
\end{equation}

\noindent
To realize these conditons we have basically two approaches.
In the one approach, we shring $V-n$ as $n$, the number of training samples, increases,
for example, according to $V_n = V_0/\sqrt{n}$. In the other approach, 
we enlarge $V_n$ as $n$ increases, for example $k_n = k_0 \sqrt{n}$.

\noindent
The former is called {\it Parzen-window} approach and 
the latter {\it $k_n$ Nearest Neighbor} approach, both of which we are going to 
exploring in the next two subsections.


\subsection{Parzen-Window classifier}
First of all, we define new function
\begin{eqnarray}
\varphi (\mbox{\bf u}) = \left\{
\begin{array}{ll}
1    & \hspace{.2in}\mbox{if}\hspace{.2in} -\frac{1}{2} < u < \frac{1}{2}\\
0    & \hspace{.2in}\mbox{if}\hspace{.2in} otherwise\\
\end{array}
\right.
\end{eqnarray}

\noindent
Then we get
\begin{eqnarray}
\varphi (\mbox{\bf x} - \mbox{\bf x}_i/h_n) = \left\{
\begin{array}{ll}
1    & \hspace{.2in}\mbox{if}\hspace{.2in} x_i \hspace{.1in} \mbox{falls within the hypercube}\\
0    & \hspace{.2in}\mbox{if}\hspace{.2in} otherwise\\
\end{array}
\right.
\end{eqnarray}
where the hypercube is locate such that its center at {\bf x} 
and the size of all hedges is $h_n$.
Here we can put $k_n$ as 

\begin{equation}
k_n = \sum_{i=1}^n \varphi (\mbox{\bf x} - \mbox{\bf x}_i/h_n)
\end{equation}

\noindent
It's important to note that this is not a function of {\bf x} at this moment.
Now recalling our goal is to obtain the density $p(\mbox{\bf x})$ at {\bf x},
we are happy to obtain the following equation.

\begin{equation}
p_n(\mbox{\bf x}) = \frac{1}{n} \sum_{i=1}^n {\frac{1}{V_n} 
\varphi (\frac{\mbox{\bf x} - \mbox{\bf x}_i}{h_n})}
\end{equation}


\bigskip
\begin{excersize}
Assumming $p(x) = U(0,1)$, try to find what happens if wu use $\pi_n(x)$ as
Parzen window function. Draw $p_n(x)$ in each case of $h_n=1, 1/4, 1/16$.
\end{excersize}


\subsubsection{A Neural Network Installation}

\begin{figure}[bhtp]
 \begin{center}
\includegraphics[width=8cm,height=5cm]{nn-for-bayes.eps}
 \end{center}
 \caption{A cloud of 200 points each with different three different $N[\mu, \Sigma]$.}
\end{figure}


\begin{figure}[bhtp]
 \begin{center}
\includegraphics[width=8cm,height=5cm]{nn-for-bayes-2.eps}
 \end{center}
 \caption{A cloud of 200 points each with different three different $N[\mu, \Sigma]$.}
\end{figure}


\subsection{Nearest-Neighbor classifier}

\section{Baesian Blief Network}
We now assume we have $N$ variables like our previous example of salmon. 
\begin{center}{\it
Salmon{female, male}  \\
Season {spring, summer, autumn, winter}\\
Location{sea, river, lake} \\
Length{short, midium, long}
}\end{center}
\noindent
Or in more general form
\begin{displaymath}
A(a_1, a_2, \cdots)
\end{displaymath}
\begin{displaymath}
B(b_1, b_2, \cdots)
\end{displaymath}
\begin{displaymath}
C(c_1, c_2, \cdots)
\end{displaymath}
and we also know, for example,
\begin{displaymath}
p(\mbox{\bf a} \vert \mbox{\bf b})
\end{displaymath}
\begin{displaymath}
p(\mbox{\bf b} \vert \mbox{\bf c})
\end{displaymath}
\noindent
etc... then we can fepresent this by a kind of network like:

\bigskip
\begin{figure}[bhtp]
\begin{center}
\includegraphics[width=5cm,height=5cm]{blief-net.eps}
\end{center}
\end{figure}

\noindent
We call this network {\it Bayesian Blief Network}. Assumption here is
we know all ...

\noindent
The aim is 
\begin{itemize}
\item To classify a new data {\bf x}.
\item To know the probability of an unknown variable from other known variables.
\end{itemize} 


\begin{figure}[bhtp]
\begin{center}
\includegraphics[width=5cm,height=5cm]{blief-net-2.eps}
\end{center}
\end{figure}

\noindent
We now take $\mbox{\bf x} = (x_1, x_2, \cdots)$ as a {\it belief} of $x_1, x_2, \cdots$
under {\it evidences} $\mbox{\bf e} = (e_1, e_2, \cdots)$ where some of the $e_i$
belong to {parents} variable and the others to {children}.
Then the elementary calculation is 
\begin{equation}
p(\mbox{\bf x} \vert \mbox{\bf e}_{\mbox {\small all}})
= 
p(\mbox{\bf e}_{\mbox {\small children}} \vert \mbox{\bf x}) 
\cdot 
p(\mbox{\bf x} \vert \mbox{\bf e}_{\mbox {\small parents}})
\end{equation}
where
\begin{equation}
p(\mbox{\bf e}_{\mbox {\small children}} \vert \mbox{\bf x})
=
p(e_{c_1} \cdots e_{c_n} \vert \mbox{\bf x})
= 
p(e_{c_1} \vert \mbox{\bf x}) \cdots p(e_{c_1} \vert \mbox{\bf x}) 
=
\Pi_{i} p(e_{c_i} \vert \mbox{\bf x})
\end{equation}
and 
\begin{equation}
p(\mbox{\bf x} \vert \mbox{\bf e}_{\mbox {\small parents}})
=
p(\mbox{\bf x} \vert e_{p_1}, e_{p_2}, \cdots)
=
\sum p(\mbox{\bf x} \vert p_1, p_2, \cdots) p(p_1, p_2, \cdots \vert e_{p_1}, e_{p_2}, \cdots)
\end{equation}
where
\begin{equation}
p(p_1, p_2, \cdots \vert e_{p_1}, e_{p_2}, \cdots)
= p(p_1 \vert e_1)\cdotp(p_2 \vert e_2)
\cdots
\end{equation}

Like
\begin{displaymath}
p(x_1) = p(x_1 \vert a_1 b_1)p(a_1)p(a_2) + \cdots
\end{displaymath}


\noindent
For example, in our example of salmon, assume we know all 
\begin{eqnarray}
\begin{array}{ccc}
\hspace*{.1in} & female & male \\
winter    & 0.9  & 0.1 \\
spring    & 0.3  & 0.7 \\
summer    & 0.4  & 0.6 \\
autumn    & 0.8  & 0.2 
\end{array}
\end{eqnarray}


\begin{example} \hspace*{1in}\\
\begin{itemize}
\setlength{\itemsep}{0pt}
\item[(1)] A new salmon was caught in {\it winter}, 
at the {\it lake}. The size is {\it long} and very {\it thin}. 
Then, what is the probability that it is a female salmon?
\item[(2)] A new salmon we have is {\it short}, {\it thick}, and cought in a {river}.
Then what is the most likely season when it was caught?
\item[(3)] We have a {\it long}, {\it thin} salmon caught in {\it winter}.
What is the probability that it is from sea.
\item[(4)] It is caught between {\it autumn} and {\it winter} from a {\it lake},
and we have no information as for the length of the fish, but this is {\it medium thickness}.
Is the salmon male?
\item[(5)] {\it Short} and {\it thick} which season you guess that the salmon was caught?
\item[(6)] {\it Long}, {\it thick} from a {\it lake} what season is mostlikely?
\end{itemize}
\end{example}

\section{Hidden Markov Model}

\subsection{Markov Model}
We now assume that pattern is a sequence of states $x(t)$, something like
\begin{displaymath}
{x(1), x(2), x(3), \cdots, x(T)}
\end{displaymath}
Here we assume time is descrete like $1,2,3, \cdots, T$. 
Let's denote a sequece of states whose length is $T$ as $\mbox{\bf x}^T$
Imagine
\begin{displaymath}
\mbox{\bf x}^7 = {x_1, x_6, x_3, x_3, x_1, x_2, x_4}
\end{displaymath}
An example of states is a whether. Think of $x_1$ is fine whether, $x_2$ is cloudy day,
$x_3$ is a rainy day, and so on.
Assume we know the transition probability from any state to any state, like
\begin{equation}
P(x_j(t+1) \vert x_i(t)) = a_{ij}.
\end{equation}
We call a full set of these $a_{ij}$ {\it model}.
When we denote this full set of $a_{ij}$ as $\boldsymbol{\theta}$,
the probability of above example of sequence $\mbox{\bf x}^7$ 
under the model $\boldsymbol{\theta}$ is expressed as
\begin{equation}
P(\mbox{\bf x}^7  \vert \boldsymbol{\theta}) 
= a_{16} \cdot a_{63} \cdot a_{33} \cdot a_{31} \cdot a_{12} \cdot a_{24}.
\end{equation}

\noindent
Our goal here is to find a model which best explains training patterns of a known class,
and later, a test pattern is classified by the model that has the highest posterior probability
$P(\omega \vert \mbox{\bf x})$.

\subsection{Hidden Markov Model}
Assume now the state $x(t)$ is not observable, but emits visible symbol $y(t)$. For example,
\begin{equation}
{y_5, y_3, y_2, y_2, y_7, y_2, y_3}
\end{equation}
If we know
\begin{equation}
P(y_k(t) \vert x_j(t)) = b_{jk}
\end{equation}
Where $\sum_j a_{ij} = 1$ for all $i$ and $\sum_j b_{jk} = 1$ for all $j$.
This is called a {\it Hidden Markov Model} because a sequence $x(t)$ is hidden to 
observers.\\

\noindent
Then we have three different problems.
\begin{itemize}
\item Evaluation Problem
   \begin{itemize}
      \item Evaluation of probabilitiy of a particular sequence of visible state.
   \end{itemize}
\item Decoding Problem
   \begin{itemize}
      \item Assuming we know all of $a_{ij}$ and $b_{jk}$, 
             determinatin of the most likely sequence of hidden states for a given
             observation of visible sequence.
   \end{itemize}
\item Learning Problem
   \begin{itemize}
      \item Estimation of $a_{ij}$ and $b_{jk}$ 
            from a given set of training obserbations of visible sequences.
   \end{itemize}
\end{itemize}


\section{Fuzzy Classification}

Degree of belongingness -> degree of membership from 0 to 1.\\


\noindent
Examples of membership function

\begin{figure}[bhtp]
\begin{center}
\end{center}
\caption{Trapezoid on corner, trapezoid, triangle singleton, traiangle on corner}
\end{figure}

\noindent
Do you like cold beer?

\begin{figure}[bhtp]
\begin{center}
\end{center}
\caption{Very cold, cold, not-so cold, worm}
\end{figure}

\noindent
Yet another form of membership function --- Cauchy Bell-like
\begin{equation}
\mu(x) = \frac {1}{1 + (x-12)^2}
\end{equation}

All real number close to 12
\begin{figure}[bhtp]
\begin{center}
\end{center}
\caption{All real number close to 12.}
\end{figure}

gaussian membership function

\subsection{Basic Arethmetic}
$AuB$ $AcupB$ $Abar$

\subsubsection{}
When two membership function have a same domain (which is sometimes called
{\it universe of dicourse})
An example (comparison with traditional set theory)
close to 5 and close to 7
close to 5 or close to 7
not close to 5
1)
2)
3)


\subsection{when x and y are defiened on different domain}

young and tall
given membership of each 15-35, 170-180 e.g.


how we represent
1) 3-d
2) gfray scale
3) descretize -> matrix
   175 180 185 190
15
20
25
30
35


                 A and B             A or B

Standard    Min(mu_a, mu_b)            Max()

            mu_a mu_b                 mu_a + mu_b - mu_a mu_b

Lukasiewicz  max (0, mu_a + mu_b -1}  min(1, mu_a + mu_b)

Einstein

mu_a mu_b/(2-(mu_a+mu_b)+mu_a mu_b)    (mu_a + mu_b)/(1+ mu_a mu_b)

excersize
given mu_a, mu_b as ...

\subsection{IF-THEN}

Mamdani's definition
$\mu_{A->B}$ = min(mu_a, mu_b)

Larsen
 = mu_a mu_b


It's important because the goal of fuzzy, inshort, is to obtain y =f(x)
from a set of fuzzy rule


\subsection{Let's visualize IF-THEN rules}

1) simplest case
example if x=low then y = high
given membership as 10-40  300-400

mu_R (x1, x2) = min(mu_L(x1), mu_H(x2))

2) If precondition is multiple like 

IF x1 = low and x2 =midium then y=high

mu_R (x1, x2) = min(mu_L(x1), mu_M(x2), mu_H(y))
is defiend on 3-D space (x1, x2, y)

Hence connot observe

Only when we give x1, and x2 a specific value

mu(x1,x2,y) vs y canbe plotted on 1-D axis of y

3) what about two rules like
R1: if x=low, then y = high
R2: if x=midium, then y = midium

again only plot giving specific value of x

Then mu_R1(y), and mu_R2(y) can be seen as befor

This time we further
MAX due to R1 OR R2

Now we plott mu(y) on y axisx

Why this (mu(y) on y) makes us happy?

Defuzzification

center of gravity


Example in more general case
distance small, ..., large
my_speed very low ... very high
brake weak... stribg

R1: if x1 = .. AND x2 = ... THEN y = ..
R2
R3


Then if x1 = 30 m my-spead is 50km/h then how much should be y?

\subsection{Fuzzy Classification}
\subsubsection{fixed membership}

\subsubsection{Takagi Sugeno Model}


\clearpage
\section*{APPENDIX} 
\subsection*{I. Bayes formula in another scenario}
\begin{itemize}
\item[$\cdot$] Three prisoners ({\bf A}, {\bf B}, and {\bf C}) are 
               in a prison.
\item[$\cdot$] {\bf A} knows that the two out of the three are to 
               be executed tomorrow, and the rest becomes free.
\item[$\cdot$] {\bf A} thought either one of {\bf B} or {\bf C} is sure
               to be executed. 
\item[$\cdot$] Then, {\bf A} asked a guard ``even if you tell me
               which of {\bf B} and {\bf C} is executed, that will not give
               me any information as for me. So please tell it to me.'' 
\item[$\cdot$] The guard answers that {\bf C} will. $\Rightarrow$ data $D$
\item[$\cdot$] Now, {\bf A} knows one of {\bf A} or {\bf B} is sure to be
               free.
\end{itemize}

\begin{center} \underline{\hspace*{7in}} \end{center} 
%%\hspace*{1in}\\
\noindent
Do you guess probability $p(A \vert D) = 1/2$?\\
\hspace*{1in}\\
If this is correct, then the answer of the guard had given an information 
as for A, since probability $p(A)=1/3$.\\

\noindent
You agree that
   \begin{displaymath}
   p(A) = p(B) = p(C) = 1/3.
   \end{displaymath}
Then, try to apply Bayesian rule, i.e., obtain the conditional 
probability of the data ``C will be executed'' under the condition that 
``A will be free tomorrow'' And in the same way for B and C. They are:
\begin{eqnarray}
\nonumber
p(D \vert A) &=& 1/2. \\
\nonumber
p(D \vert B) &=& 1.   \\
\nonumber
p(D \vert C) &=& 0.
\end{eqnarray}
In conclusion:
\begin{displaymath}
p(A \vert D)=\frac{p(D \vert A)p(A)}
                  {p(D \vert A)p(A)+p(D \vert B)p(B)+p(D \vert C)p(C)}
=1/3.
\end{displaymath}
This shows probability did not change after the information!

\clearpage
\begin{itemize}
\item Now it's clear that
   \begin{itemize}
   \item $p(A)$ and so on are to be called 
      \begin{itemize} 
      \item[] $\star$ {\it a priori} probabilities;
      \end{itemize}
   \item and $p(A \vert D)$ and so on to be
      \begin{itemize} 
      \item[] $\star$ {\it a posteriori} probabilities.\\
      \end{itemize}
   \end{itemize}
\item In the same way,
   \begin{itemize}
   \item $p(\omega_i)$ is called 
      \begin{itemize} 
      \item[] $\star$ {\it a priori} probability\footnote{Usually given,
but if unknown, it can be estimated as $N_i/N$ where $N_i$
is the number of training samples which belong to class $\omega_i$, 
and $N$ is the total number of training samples};
      \end{itemize}
   \item $p(\omega_i  \vert \mbox{\bf x})$
      \begin{itemize} 
      \item[] $\star$ {\it a posteriori} probabilities.\\
      \end{itemize}
   \end{itemize}
\item[]Furthermore
   \begin{itemize}
   \item $p(\mbox{\bf x})$ is called
      \begin{itemize} 
      \item[] $\star$ p.d.f. of {\bf x}
      \end{itemize}
   \item $p(\mbox{\bf x} \vert \omega_i)$
      \begin{itemize} 
      \item[] $\star$ class-conditional p.d.f.\footnote{This is also
estimated from training data which will be explained later more in detail.}
         \begin{itemize}
         \item which describes the distribution of 
               the feature vectors {\bf x}
               in each of the classes $\omega_i$.
         \end{itemize}
      \end{itemize}
   \end{itemize}
\end{itemize}

\clearpage
\subsection*{II. Quadratic form in 2-dimensional space}
You might be interested, first of all, in how points scattered are influenced 
by values in $\Sigma$, that is, $\sigma_1^2$, $\sigma_2^2$, and $\sigma_{12}=\sigma_{21}$.
Let's observe here three different cases of $\Sigma$ when $\mu=(0, 0)$.
\begin{small}
\begin{eqnarray}
\nonumber
(1) \hspace{.1in}\Sigma_1 =
\left(
\begin{array}{cc}
0.20 & 0             \\
0    & 0.20
\end{array}
\right)
\hspace{.1in}
(2) \hspace{.1in}\Sigma_2 =
\left(
\begin{array}{cc}
0.1 & 0             \\
0   & 0.9
\end{array}
\right)
\hspace{.1in}
(3) \hspace{.1in}\Sigma_3 =
\left(
\begin{array}{cc}
0.5 & 0.3             \\
0.3 & 0.2
\end{array}
\right)
\end{eqnarray}
\end{small}

\begin{figure}[bhtp]
 \begin{center}
\includegraphics[width=5cm,height=5cm]{gaussian-2d-200-points-1.eps}
\includegraphics[width=5cm,height=5cm]{gaussian-2d-200-points-2.eps}
\includegraphics[width=5cm,height=5cm]{gaussian-2d-200-points-3.eps}
%%   \epsfile{file=gaussian-2d-200-points-1.eps,width=5cm,height=5cm}
%%   \epsfile{file=gaussian-2d-200-points-2.eps,width=5cm,height=5cm}
%%   \epsfile{file=gaussian-2d-200-points-3.eps,width=5cm,height=5cm}
 \end{center}
 \caption{A cloud of 200 points each with different three different $N[\mu, \Sigma]$.}
\end{figure}

\clearpage
\subsection*{III.  How to calculate inverse of 3-dimensional matrix.} \label{append}
We now try to calculate the invers of the following 3-D matrix $A$ which appeared 
in the subsection \ref{subsec}.

\begin{eqnarray}
\nonumber
A =
\left(
\begin{array}{ccc}
 0.3 &  0.1 &  0.1      \\
 0.1 &  0.3 & -0.1      \\
 0.1 & -0.1 &  0.3
\end{array}
\right)
\end{eqnarray}

\noindent
We use a relation 
$A\mbox{{\bf x}} = I$
where $\mbox{{\bf x}} = (x, y, z)^T$ and $I$ is {\it identity matrix}, i.e.,
\begin{eqnarray}
\nonumber
\left(
\begin{array}{ccc}
 0.3 &  0.1 &  0.1      \\
 0.1 &  0.3 & -0.1      \\
 0.1 & -0.1 &  0.3
\end{array}
\right)
\left(
\begin{array}{r}
x \\
y \\
z 
\end{array}
\right)
=
\left(
\begin{array}{ccc}
1 & 0 & 0       \\
0 & 1 & 0       \\
0 & 0 & 1
\end{array}
\right)
\end{eqnarray}
It remains identical 
if we multiply \{{\it 2nd-raw}\} by 3 and subtract the {\it \{1st-raw\}}, i.e.,
\begin{eqnarray}
\nonumber
\left(
\begin{array}{ccc}
 0.3 &  0.1 &  0.1      \\
 0   &  0.8 & -0.4      \\
 0.1 & -0.1 &  0.3
\end{array}
\right)
\left(
\begin{array}{r}
x \\
y \\
z 
\end{array}
\right)
=
\left(
\begin{array}{ccc}
1  & 0  & 0       \\
-1 & 3  & 0       \\
0  & 0  & 1
\end{array}
\right)
\end{eqnarray}
In the same way, but this time, we multiply \{{\it 3rd-raw}\} by 3 and subtract the {\it \{1st-raw\}}.
\begin{eqnarray}
\nonumber
\left(
\begin{array}{ccc}
 0.3 &  0.1 &  0.1      \\
 0   &  0.8 & -0.4      \\
 0   & -0.4 &  0.8
\end{array}
\right)
\left(
\begin{array}{r}
x \\
y \\
z 
\end{array}
\right)
=
\left(
\begin{array}{ccc}
 1 & 0  & 0       \\
-1 & 3  & 0       \\
-1 & 0  & 3
\end{array}
\right)
\end{eqnarray}
Then, e.g.,
multiply the {\it \{1st-raw\}} by 8 and then subtract the {\it \{2nd-raw\}}:
\begin{eqnarray}
\nonumber
\left(
\begin{array}{ccc}
 2.4 &  0   &  1.2        \\
 0   &  0.8 & -0.4        \\
 0   & -0.4 &  0.8
\end{array}
\right)
\left(
\begin{array}{r}
x \\
y \\
z 
\end{array}
\right)
=
\left(
\begin{array}{ccc}
 9 & -3  &   0       \\
-1 &  3  &   0       \\
 0 &  0  &   3
\end{array}
\right)
\end{eqnarray}
Multiply the {\it \{3rd-raw\}} by 2 and then add the {\it \{2nd-raw\}}:
\begin{eqnarray}
\nonumber
\left(
\begin{array}{ccc}
 2.4 &  0   &  1.2        \\
 0   &  0.8 & -0.4      \\
 0   &  0   &  1.2
\end{array}
\right)
\left(
\begin{array}{r}
x \\
y \\
z 
\end{array}
\right)
=
\left(
\begin{array}{ccc}
 9 & -3  & 0       \\
-1 &  3  & 0       \\
-1 &  3  & 6
\end{array}
\right)
\end{eqnarray}
Subtract {\it \{3rd-raw\}} from the  {\it \{1st-raw\}}:
\begin{eqnarray}
\nonumber
\left(
\begin{array}{ccc}
 2.4 &  0   &  0        \\
 0   &  0.8 & -0.4      \\
 0   &  0   &  1.2
\end{array}
\right)
\left(
\begin{array}{r}
x \\
y \\
z 
\end{array}
\right)
=
\left(
\begin{array}{ccc}
 10 & -6 & -6       \\
 -1 &  3 &  0       \\
 -1 &  3 &  6
\end{array}
\right)
\end{eqnarray}
Multiply the {\it \{2nd-raw\}} by 3 then add the {\it \{3rd-raw\}}:
\begin{eqnarray}
\nonumber
\left(
\begin{array}{ccc}
 2.4 &  0   &  0        \\
 0   &  2.4 &  0        \\
 0   &  0   &  1.2
\end{array}
\right)
\left(
\begin{array}{r}
x \\
y \\
z 
\end{array}
\right)
=
\left(
\begin{array}{ccc}
 10 &  -6 & -6       \\
 -4 &  12 &  6       \\
 -1 &   3 &  6
\end{array}
\right)
\end{eqnarray}
Finally, devide the {\it \{1st-raw\}} by 2.4, devide the {\it \{2nd-raw\}} by 2.4, and
devide the {\it \{3rd-raw\}} by 1.2, we obtain,
\begin{eqnarray}
\nonumber
\left(
\begin{array}{ccc}
 1   &  0   &  0        \\
 0   &  1   &  0        \\
 0   &  0   &  1
\end{array}
\right)
\left(
\begin{array}{r}
x \\
y \\
z 
\end{array}
%% 25/6  & -5/2 & -5/2       \\
%% -5/3  &  5   &  5/2       \\
%% -5/6  &  5/2 &  5
\right)
= \frac{1}{30}
\left(
\begin{array}{ccc}
 5  & -3 & -3       \\
-2  &  6 &  3       \\
-1  &  3 &  6
\end{array}
\right)
\end{eqnarray}
Now we know the right-hand-side is the invers of $A$ because the equation implies
$I \mbox{{\bf x}} = B$ and it holds
$AI \mbox{{\bf x}} = AB$, that is,
$A \mbox{{\bf x}} = AB$. Hence $AB = I$ which means $B=A^{-1}$.\\

%% 0.3 &  0.1 &  0.1       25  & -15 & -15    7.5-1.0-0.5  -4.5+3.0+1.5  -4.5+1.5+3.0  6 0 0
%% 0.1 &  0.3 & -0.1      -10  &  30 &  15    2.5-3.0+0.5  -1.5+9.0-1.5  -1.5+4.5-3.0  0 6 0          
%% 0.1 & -0.1 &  0.3       -5  &  15 &  30    2.5+1.0-1.5  -1.5-3.0+4.5  -1.5-1.5+9.0  0 0 6

\noindent
To make it sure, calculate and find

\begin{eqnarray}
\nonumber
\left(
\begin{array}{ccc}
 0.3 &  0.1 &  0.1      \\
 0.1 &  0.3 & -0.1      \\
 0.1 & -0.1 &  0.3
\end{array}
\right)
\times
\frac{1}{30}
\left(
\begin{array}{ccc}
 5  & -3 & -3       \\
-2  &  6 &  3       \\
-1  &  3 &  6
\end{array}
\right)
\end{eqnarray}

\begin{eqnarray}
\nonumber
= \frac{1}{6}
\left(
\begin{array}{ccc}
 25  & -15 & -15       \\
-10  &  30 &  15       \\
 -5  &  15 &  30
\end{array}
\right)
\times
\left(
\begin{array}{ccc}
 0.3 &  0.1 &  0.1      \\
 0.1 &  0.3 & -0.1      \\
 0.1 & -0.1 &  0.3
\end{array}
\right)
= 
\left(
\begin{array}{ccc}
1 & 0 & 0  \\
0 & 1 & 0  \\
0 & 0 & 1
\end{array}
\right)
\end{eqnarray}

\noindent
Therefore
\begin{eqnarray}
\nonumber
A^{-1}
= \frac{1}{30}
\left(
\begin{array}{ccc}
 5  & -3 & -3       \\
-2  &  6 &  3       \\
-1  &  3 &  6
\end{array}
\right)
\end{eqnarray}


\end{document}
%%================================================================================================