Runner

Hacking to the gate


  • Home

  • Tags

  • Categories

  • Archives

Research-Note-3

Posted on 2017-06-23 | In Research Note , T-mesh

Research Note 3

This week, we finally get into the meat, spline. After I have got the basic geometric functions to work, we can move on to evaluate splines on rectangulations.

Interpolating Functions

As I have discussed in the last Research Note, a spline on the rectangulation is stored as a long coefficient vector. That vector is separated into three chunks for vertices, edges and interior domain points respectively. But where are these coefficients coming from? How are we going to interpolate functions on rectangulation?

Similar to what happens on triangulation in 2D space, we construct the coefficient vectors by looping over all the unit rectangles, calculate the coefficient, and put it in the right place in long vector $c$.

We solve the interpolation problem same as we did in any other cases, except we are using Bernstein polynomial as basis functions. First I use a function to calculate all the value of Berstein polynomial on $d+1$ equally spaced point on $[0,1]$ to get a matrix $B$, where $d$ is the degree of the $x$-direction polynomial. Same operation for $y$ direction and I get another matrix $\widetilde{B}$ using $\widetilde{d}$, the dimension for $y$-direction polynomial. Then I calculate the value of the target function $f$ at the $(d+1)\times (\widetilde{d}+1)$ grid points to get a value matrix $V$. Finally I solve the linear system to get the coefficient matrix $C$. $$ B\times C \times\widetilde{B} = V $$ After I got the right coefficient, I use the function getindex to find the index of $c$ to put the coefficient in. And all that above is the first major function I worked on this week. It's named interdp for interpolating domain points.

Note, here our work can only handle functions on a rectangulation without a hanging vertex. For refinement of rectangulation, we need to construct smoothness conditions and solve for the corresponding coefficients. That will be the major part of my work next week.

Example

A quick small example. Suppose we are interpolating function $f$ with $d=3$ and $\widetilde{d}=4$ on unit square. $$ f = x^2 + 2y^3 $$ The result plot on the 3 by 3 rectangulation is

<img src = "/images/Research/Tmesh/interdp.jpg" width="500">

And the root mean square error of the resulting surface is $1.7\times 10^{-16}$.

Interpolation of Partial Derivatives

After I tackle the problem of interpolate a general function on the rectangulation. I start to work on the partial derivatives of a general function.

I start by experimenting on a tensor product Bezier patch, or just a rectanlge with Bernstein basis polynomial. Solving the interpolation problem for the partial derivative of a function is equivalent to taking the partial derivative of the Bernstein basis polynomial and adjust the coefficients.

With basic linear algebra, one can easily find out that differentiating a matrix of basis function is same as multiplying a forward difference matrix. Since the interpolation solution is given by the product of three matrices, we can do the forward difference on any one that contains the variables. Professor Schumaker prefers to compute a lower degree Bernstein basis polynomial matrix and do the forward difference on the coefficient matrix $C$.

Eventually, we want to have a function that can give the values of the partial derivatives of any degree of a function on a grid. That will be another task for me in next week.

Clustering

Posted on 2017-06-18 | In Study Note , Machine Learning

Clustering

Clustering is a unsupervised learning technique to find subgroups or clusters in a data set.

Comparing with PCA:

  • PCA looks to find a low-dimensional representation of the observations that explain a good fraction of the variance.
  • Clustering looks to find homogeneous subgroups among the observations.

The two most famous clustering techniques are K-means clustering and hierarchical clustering. In K-means, we seek to partition the observations into predefined number of clusters. In hierarchical clustering, we do not know in advance how many clusters will there be; we end up with a tree-like visual representation of the observations, called dendrogram.

K-Means Clustering

It partitions the observations into k non-overlapping clusters. Need to specify $K$ first. Let $C_1,C_2,…,C_k$ be sets containing the indices of data in each cluster. Then they are exhaustive and mutually exclusive. We want the within-cluster variance to be as small as possible, which is denoted by $W(C_i)$.

Finding K-means clustering is the same as solving problem: $$ minimize{\sum_{k=1}^K W(C_k) } $$ By far the most common way to define $W(C_i)$ is using squared Euclidean distance, by

$$ W(C_k) = \frac{1}{|C_k|}\sum_{i,i'\in C_k}\sum_{j=1}^p(x_{ij}-x_{i'j'} )^2 $$ where $p$ is the dimension of the feature space.

Solving the above minimization problem is pretty hard. Fortunately, there is a simple algorithm that gives a pretty good solution:

— Algorithm:

  1. Randomly assign a number from 1 to $K$ to each of the observations. These are the initial cluster assignments
  2. Iterate until the cluster assignments stop changing:
    1. For each of the $K$ clusters, compute the cluster centroid. The kth cluster centroid is the vector of the $p$ feature means for the observations in that cluster.
    2. Assign each observations to the cluster whose centroid is closest in the context of Euclidean distance.

This algorithm guarantees to decrease the target sum at each step and it will finally reach a local optimum. Since the algorithm doesn't ensure a global optimum, it's important to run the algorithm with multiple times with different random initial configuration.

In addition, selecting the number $K$ is also a difficult problem.

Hierarchical Clustering

The most common type of hierarchical clustering is bottom-up or agglomerative clustering. It means to build the dendrogram from the leaves and combining clusters up to the trunk.

To interpret the dendrogram, just remember observations that are similar muse at the bottom and those that are quite different muse much later. Location of leaves means nothing.

To identify clusters in dendrogram, horizontally cut the tree and find the branches. Cut at different height may give different number of clusters. A dendrogram can be used to obtain any number of clusters.

— Algorithm

  1. Begin with n observations and a measure (e.g. Euclidean distance) of all the n(n-1)/2 pairwise dissimilarities. Each of the n observations is treated as its own cluster.
  2. Fuse the two most similar till there is only one cluster left. The dissimilarity between the most similar two indicates the height at which the fusion should be taken.

For this algorithm to work, need to extend dissimilarity to between groups of observations. Use the notion of linkage.

4 types of linkage:

  1. Complete: maximal intercluster dissimilarity
  2. Single: minimal intercluster dissimilarity
  3. Average: mean intercluster dissimilarity
  4. Centroid; dissimilarity between the centroids

Issues for Clustering

Small Decisions with Big Impacts

  • For hierarchical clustering,
    • What dissimilarity measure to use?
    • What type of linkage?
    • Where should we cut the dendrogram to obtain the clusters?
  • For K-means clustering
    • What value of $K$?

Validating the Clusters

Need to validate such that our clustering is not a clustering of the noise.

PCA

Posted on 2017-06-18 | In Study Note , Machine Learning

Principal Component Analysis

PCA refers to the process by which principal components are computed and the subsequent use of these components.

It's a technique to summarize a large set of correlated variables with a smaller number of representative variables.

Principal components

Principal components are directions in p-dimension feature space that the observations that are as interesting as possible. The concept of interesting is meatured by the amount that the observations vary along each dimension. Each of the dimensions found by PCA is a linear combination of all the p features.

The first principal component of a set of features $X_1,X_2,…,X_p$ is the normalized linear combination of the features $$ Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + ... + \phi_{p1}X_p $$ that has the largest variance. By normalzed, it means that $\sum_{j=1}^p\phi_{j1}^2=1$. The $\phi_{11},…,\phi_{p1}$ are called loadings of the first principal component and together they form the loading vector.

Since we are only interested in variance, we can assume that each variables has been centered to have mean zero. Then finding the loading vector is essentially solving the optimization problem: $$ \underset{\phi_{11},...,\phi_{p1}}{maximize}{\frac{1}{n}\sum_{i=1}^n (\sum_{j=1}^p\phi_{j1}x_{ij})^2 },, s.t; \sum_{j=1}^p \phi_{j1}^2=1 $$ Since $z_{i1} = \phi_{11}x_1 + \phi_{21}x_2 + ... + \phi_{p1}x_p$ and $\frac{1}{n}\sum_{i=1}^nx_{ij}=0$, we have the average of $z_{11},…,z_{n1}$ is zero. So what we are optimizing is just the sample variance of the n values of $z_{i1}$. We call $z_{11},…,z_{n1}$ the scores of the first principal component.

After the first principal component $Z_1$ is determined, we can find the second principal component $Z_{2}$. The second principal component is a linear combination of $X_1,…X_p$ that has maximal variance out of all linear combinations that are uncorrelated with $Z_1$. It turns out that constraining uncorrelated is equivalent to constraining the direction to be orthogonal. The other principal components can be computed the same way.

Once we have computed all the principal components, we can plot them against each other in order to produce low-dimensional views of the data.

More on PCA

Scaling the Variables

Before the PCA is performed, the variable should be centered to have mean zero. Furthermore, the results obtained is dependent on whether the variables have been scaled individually.

In addition, the standard deviation is usually scaled to one before PCA for measures that have different units. If variables have different std, the result can be misleading. However, in certain settings, the variabels may be measured in the same units. In this case we might not with to scale them to have std 1.

Uniqueness

Each principal component loading vector is unique, up to a sign flip. So two different software package should give the same principal components, though the sign may be opposite.

The Proportion of Variance Explained

We are interested in knowing the proportion of variance explained (PVE) by each principal component. The total variance in a data set (assuming mean zero for each variables) is defined as $$ \sum_{j=1}^p\textrm{Var}(X_j) = \sum_{j=1}^p\frac{1}{n}\sum_{i=1}^nx_{ij}^2 $$ and the variance explained by the $m$th principal component is $$ \frac{1}{n}\sum_{i=1}^nz_{im}^2 = \frac{1}{n}\sum_{i=1}^n(\sum_{j=1}^p\phi_{jm}x_{ij})^2 $$ Therefore the PVE of the $m$th component is given by $$ \frac{\sum_{i=1}^n(\sum_{j=1}^p\phi_{jm}x_{ij})^2}{ \sum_{j=1}^p\sum_{i=1}^nx_{ij}^2} $$ In total, there are min($n-1$, $p$) principal components and their PVEs sum to one.

Deciding How Many PC to Use

We typically are not interested in all of PC; rather we would like to use just the first few of them. We usually decide by examing a screen plot. Using the elbow method, we find the principal component that the subsequent variance explained drops off.

This type of visual analysis is inherently ad hoc. But unfortunately, there is not well-accepted objective way to decide how many PC are enough.

SVM

Posted on 2017-06-18 | In Study Note , Machine Learning

Support Vector Machine

Chapter 9 of Statistical Learning. Starts form Maximal Margin Classifer, to Support Vector Classifier and finally to SVM.

Maximal Margin Classifier

In a p-dimension, a hyperplane is a flat affine subspace of dimension p-1. For example, in two dimension, a hyperplane is a line.

One intuitive way is to use hyperplane to classify data points. But there are inifinite way given it's possible to find one hyperplane. So find the hyperplane with the greatest minimal vertical distance from data point, named maximal margin classifier.

Note:

  1. The maximal margin hyperplane depends only on the few closest data points, which are called support vector.
  2. It may change a lot due to one data point.

Construction

Maximal margin hyperplane is the solution to the optimization problem: $$ maximize;M \\ subject; to ; \sum_{j=1}^p \beta_j^2 = 1 \\ y_i(\beta_0+\beta_1x_{i1}+...+\beta_px_{ip}) >= M ; \forall, i= 1,...,n $$ The second equation ensure that the vertical distance from any data point to the hyperplane is given by $$ y_i(\beta_0+\beta_1x_{i1}+...+\beta_px_{ip}) $$ So together, they ensure that each observation is on the correct side of the hyperplane and at least a distance M from the hyperplane.

However, the data points may not be separable. Then generalize it to support vector classifier.

Support Vector Classifier

Consider a classifier based on hyperplane but not perfectly separate the two classes in the interest of

  • Greater robustness to individual observations
  • Better classification of most of the training observations.

Construction

In this case the optimization problem becomes: $$ maximize;M \\ subject; to ; \sum_{j=1}^p \beta_j^2 = 1 \\ y_i(\beta_0+\beta_1x_{i1}+...+\beta_px_{ip}) \geq M(1-\epsilon_i) ; \forall, i= 1,...,n \\ \epsilon_i \geq 0, ; \sum_{i=1}^n\epsilon_i \leq C $$ where C is a nonnegative tuning parameter, can be thought as a budget for the amount that the margin can be violated by the n observations. When C is 0, it becomes maximal margin hyperplane optimization problem. In practice, C is generatlly chosen by cross-validation. It controls the bias-variance trade-off.

Similar to MMC, SVC has a nice property: only observations that lie on the margin or that violate the margin will affect the hyperplane. So if an observation lies strictly on the correct side of the margin, it does not affect the classifier. Observations that lie on the margin or wrong side of margin for their class are known as support vectors.

Support Vector Machine

One way to convert a linear classifier into non-linear is by enlarging the feature space using quadratic, cubic or higher polynomial functions of the predictors. The resulting decision boundary is linear in the enlarged feature space, but not in the original one. However, sometimes we may end up with a huge number of features and increase computation.

SVM is a way to enlarge the feature space but leads to efficient computation. It uses kernels.

The solution to support vector classifier problem involves only the inner products of the observations but not observations themselves. It can be shown that the linear support vector classifier can be represented as $$ f(x) = \beta_0 + \sum_{i=1}^n \alpha_i\langle x,x_i\rangle $$ where there are n parameters $\alpha_i$, one per training observation.

It turns out that only the $\alpha$ of support vectors will be nonzero, so if S is the collection of indices of support vectors, then the solution is $$ f(x) = \beta_0 + \sum_{i\in S}\alpha_i \langle x,x_i\rangle $$ Now if for every appearance of inner product, we replace it with a generalization of the inner product $$ K(x_i,x_{i'}) $$ where K is some function that we call it kernel. It quantifies the similarity of two observations. We can choose it to be inner product, which gives SVC. This is known as a linear kernel. Another choice is $$ K(x_i,x_{i'})= (1+\sum_{j=1}^p x_{ij}x_{i'j})^d $$ This is known as a polynomial kernel of degree d, where d is a positive integer. It will give a much more flexible decision boundary. It essentially amounts to fitting a support vector classifier in a higher dimensional space involving polynomials of degree d.

When the support vector classifier is combined with a non-linear kernel, the resulting classifier is known as a support vector machine.

The advantage of using kernel is computation. It only needs to compute the kernel function for all pairs of observation.

Extension

The SVM can be extended to multiple classes case. One vs One or One vs All.

SVM has a similar loss function as logistic regression, which often performs better, so SVM is not that widely used nowadays. The kernel trick, though it can also be applied to other classifier, is most used in the context of SVM.

Research-Note-2

Posted on 2017-06-17 | In Research Note , T-mesh

Research Note 2

This week, I finish the implementation of refinerec function and also move on to getcoef and findrec. Refinerec is a complicated function and so far I did not find an easy way to reduce the amount of code.

Storing Spline on Rectangulation

Getcoef stands for get coefficient. Tensor splines on the rectangulation will be like those on triangulation. We store them by storing the coefficients of Bernstein polynomial on the domain points of each rectangle. As for tensor product, we need dimensions on both x direction and y direction so in total there will be $d \times \widetilde{d}$ domain points for each unit rectangle. However, different rectangles may share some domain points.

Our storing scheme is to separate the long coefficient vector c in to three separate chunks. The first chunk is of length $n_v$, the number of vertices in rectangulation and each store the coefficient on the vertex in the vertex number order.

The second chunk is for edges. Since there are vertical edges and horizontal edges and number of domain points on these two types of edges are different, we have a long vector with size $(d-1)\times n_{he} + (\widetilde{d}-1)\times n_{ve}$. At first we tried to separate them and store the horizontal ones at first, but later on we decide to just stack them in the edge number order.

The last chunk is for interior domain points in each recitingulation. So its size is $(d-1)\times (\widetilde{d}-1)\times n_r$. We also store them in the rectangle number order.

Now that we have a long coefficient vector, it will be a problem to just grab the coefficients for a given rectangle. And that's why function getcoef is needed.

It basically accepts the rectangulation information and a number k and returns the coefficient for domain points in rectangle k.

Findrec

Findrec is another function. Given a point inside the rectangulation, this function will find the rectangle number that contains this point.

At first I thought I have a clever way to achieve it. I calculate the vector from the target point to diagonal points of each rectangle and see whether they have opposite direction. This method has time complexity $O(n_r)$, and don't need to build a $k-d; tree$. However, if there are a lot of points that needs to be found, the time complexity will scale up to that number. Professor Schumaker suggested me to first partition each unit rectangle into two triangles and use the build in tsearchn function. This way the code is faster for more points.

The example output of findrec is $$ \begin{bmatrix} 4 & 6 & 6 & 1 & 3 \end{bmatrix}^{\textrm{T}} $$ for the following diagram.

<img src="/images/Research/Tmesh/findrec.jpg" width="500">

MapReduce

Posted on 2017-06-13 | In Study Note , Bigdata , Hadoop

MapReduce

MapReduce is a programming model and associated implementation for processing and generating large data sets. Apparently it consists of two parts: map and reduce.

  • Uses map function to generate a set of intermediate key/value pairs
  • Uses reduce function to merge all intermediate values associated with the same intermediate key

There are three key ideas behind MapReduce.

Parallelism

Map and reduce are parallel in nature. Each map function can handle its own data and so as reduce. However, reduce must wait till all maps finish, so need a barrier between these two phases.

In addition, in the intermediate phase between map and reduce, usually shuffle the intermediate result to reduce them more efficiently.

Abstraction

The MapReduce paradign turns everything into two simple function, map and reduce. So developer won't have to worry about the other details.

<img src="/images/MapReduce/mapreduce2.gif">

MapReduce provides map and reduce abstract programming interface. Map interface takes key/value pair that stores data as input, process it and spit out key/value pairs as output.

1
map: (k1,v1) -> [(k2, v2)]

Reduce takes all those pairs and combines the values for the same keys. Then it finally gives the output of same form.

1
reduce: (k2, [v2]) -> [(k3, v3)]

Uniform Architecture

In previous parallel computing framework, developers need to consider storage, partition, distribution, collection and fault tolerance. But with MapReduce, they are freed from those troubles because it gives a uniform architecture to do all the dirty works.

How MapReduce Works

MapReduce consists of 4 main components.

  • Client: the one that submits the MapReduce program. Every job on client will be packed with configuration files into jar form and stored on HDFS by JobClient.
  • JobTracker: it coordinates the tasks. It's actually a java program. It communcates with clients, accept command from clients, monitor the heartbeat of TaskTracker and schedule the jobs based on schedule algorithm.
  • TaskTracker: it runs the tasks. It's also a java program which sends heartbeat to TaskTracker periodically.
  • File System: generally will prefer HDFS.

One can view JobTracker as the Master and TaskTracker as the Slave in traditional MS control paradigm.

Research Note 1

Posted on 2017-06-10 | In Research Note , T-mesh

Research Note 1

On Friday morning, I met with Professor Schumaker to discuss further about the code I wrote. Till this point, I have finished the initrec and reclist function, which together will create a long list containing information needed for the rectangle partition, or I personally called it rectangulation.

We decided to combine these two functions into one and separate the labeling and plotting into a new function called plotr. With the help from Shiying, I improved the code by vectorizing all the loops.

Now we finally move on to refinerec function. This function will take all the component of our reclist output, and a integer k, indicating which rectangle to refine. The refinement is pretty simple, we just cut the rectangle in half, into two equivalent smaller rectangles. We always cut the longer edge of the rectangle to avoid thin long ones. We also need to update the whole list to preserve the meaning of each components.

The idea sounds pretty simple, but the actual code is far more tricky than I expected. First, in our convention, we no longer preserve the original rectangle in our long list. In fact, we do not keep any superrectangles, i.e. those that contain subrectangles. We acchieve it by updating the original rectangle and adding a new one in the list. For vertices, refining one rectangle may add a hanging vertex. A hanging vertex is a vertex on an edge that is not an intersection. For example, vertex 8 in the following rectangulation is a hanging vertex.

<img src="/images/Research/Tmesh/hangver.jpg"></img>

However, refinement may also change a hanging vertex into a regular interior vertex. For instance, if we refine R2 in the above rectangulation, V8 will no longer be a hanging vertex.

For edges, in our refinement, we need to add new short edges and check whether all of them are edges of at least one rectangle. In our convention, we only store edges that are sides of rectangles that don't have subrectangle. For example, in above figure, edge that connects V3 and V4 is kept after refinement. But if we refine R2, then that edge should be removed.

One refine example is demonstrated here. We start with 2 rectangles.

<img src="/images/Research/Tmesh/recref1.jpg" width="400"></img>

Then we keep refining the first rectangle three times. The result is as followed:

<img src="/images/Research/Tmesh/recref2.jpg" width="400"></img>

<img src="/images/Research/Tmesh/recref3.jpg" width="400"></img>

<img src="/images/Research/Tmesh/recref4.jpg" width="400"></img>

The list cannot be shown due to confidentiality. So far the code is 500+ lines, and I have not yet finished.

There are hundreds of other small details that I have to focus on while implementing the code, like the keep tracking of left and right rectangles for each edges, subset relation between edges and orders of edges. So far in the first week, we are still at geometry phase, not touching any spline yet. After the refinerec function is complete, we will start to store spline on domain points and get into the real meat.

​

Research Start

Posted on 2017-06-06 | In Research Note , T-mesh

Research Start!

Today is the first day of my 2017 summer research. Under Professor Larry Schumaker's guidance, I will work on several projects in the following two and half month. Basically all of them belong to numerical analysis. In particular I am building libraries for special spline computation purpose. All the code will be written in Matlab.

The first project I will work on is the rectangular tensor product spline library for hanging vertex, which is also referred to as T-meshes in spline literature. I will first implement functions to create 2D rectangle partition and generate all the needed information such as the edges for each rectangle, or whether a given edge is a boundary edge. These first few functions are only geometric and Matlab problems. But as we move on, we will do exactly the same as we are on triangles in 2D planes. I will create the domain points and determine sets for any partitions, store all kinds of splines, compute polynomials and add smoothness at edges and vertices. Finally, when we are comfortable with 2D cases, we will move on to 3D space. Note, all the code should support hanging vertices.

I have already gotten the assignment for the first week. I need to implement code to generate nice looking rectangle partition for any given number of vertices on x axis and y axis, like the following for which (3, 4) are passed as arguments.

<img src="/images/Research/Tmesh/rec34.jpg" width='500'>

Then I also need to write a complicated code to get all the information needed of the partition. Since Professor Schumaker may use my code for his monograph and commercial purpose, I cannot disclose the detail of the code. The last task for the first week is to write a code to refine a given rectangle in the partition and update the list. The refinement can be define in any manner and it may lead to hanging vertex.

1.2 Probability Theory

Posted on 2017-06-01 | In Study Note , PRML

Probability Theory

Recently, I study the reading of Pattern Recognition and Machine Learning, aka PRML. Here I record some of the key concept and idea in Chapter 1.2 Probability Theory.

In this note, I skip the part where the book introduces the basic idea of probability, the two fundamental rules — sum rule and product rule, and the basic idea of Bayes' theorem. The only thing I need to point out is the concept of prior probability and posterior probability. The former is the probability distribution of some event to happen before gather any evidence. The latter is the one for after some evidence is obtained.

Expectations and Covariances

The expectation as we all learn in high school, is defined as

$$ E[f] = \sum\limits_{x}p(x),f(x) $$ for discrete distribution or $$ E[f] = \int p(x),f(x) $$ for continuous variable. It's the average value of some function f(x) under probability distribution p(x). In either case, we can approximate it by finite number $N$ of points as $$ E[f] \approx \frac{1}{N} \sum_{n=1}^{N} f(x_n) $$ The approximation becomes exact as $N$ goes to infinity.

A conditional expectation with respect to a conditional distribution is defined as $$ E_x[f(|y)] = \sum_xp(x|y),f(x) $$   The variance of f(x) is defined by $$ \begin{split} var[f] &= E[(f(x) - E[f(x)])^2] \\ &= E[f(x)^2] - E[f(x)]^2 \end{split} $$   For two random variables x and y, the covariance is defined as $$ cov[x,y] = E_{x,y}[{x-E[x]}{y-E[y]}] $$ In the case of two vectors of random variables $\textbf{x}$ and $\textbf{y}$, the covariance is a matrix $$ \begin{split} cov[\textbf{x},\textbf{y}] &= E_{\textbf{x},\textbf{y}}[{\textbf{x}-E[\textbf{x}]}{\textbf{y}-E[\textbf{y}]}] \\ &= E_{\textbf{x},\textbf{y}}[\textbf{x}\textbf{y}^T]-E[\textbf{x}]E[\textbf{y}^T] \end{split} $$

Baysian Probability

In early education, it's normal to treat probabilities in terms of the frequencies of random, repeatable events. This is called the classical or frequentist interpretation of probabilities. For some uncertain events, like whether the moon will crash toward the Earth, it's impossible to measure the frequency. However, we can use Bayesian interpretation of probability to represent uncertainty. It has been proven that such measure of uncertainty follow different sets of properties and axioms. In each case, they behave precisely according to the rules of probabilities. Thus it's natural to refer them as (Bayesian) probabilities.

With the observed evidence, Bayes' theorem allows as to convert a prior probability to a posterior probability. To quantify the uncertainty for the model parameter $\textbf{w}$ or even the model itself, we can use machinery of probability theory to measure.

Before we observe the data, we can capture our assumptions about $\textbf{w}$ by prior probability $p(\textbf{w})$. After we have observed $\it{D} = {t_1,…,t_N}$, Bayes' theorem, which takes the form of $$ p(\textbf{w}|\mathit{D}) = \frac{p(\mathit{D}|\textbf{w})p(\textbf{w})}{p(\mathit{D})} $$ allows to evaluate the uncertainty of $\textbf{w}$ in the form of posterior probability $p(\textbf{w}|\mathit{D})$.

The quantity $p(\textbf{w}|\mathit{D})$ can be viewed as a function of the parameter vector $\textbf{w}$ and is called the likelihood function. It expresses how probable the observed data set is for different settings of the parameter vector $\textbf{w}$. With this definition, we can state Bayes' theorem in another form $$ posterior \propto likelihood \times prior $$ where all of these quantities are viewed as function of $\textbf{w}$.

​  In both the Bayesian and frequentist paradigms, the likelihood function $p(\textbf{w}|\mathit{D})$ plays a central role but the manner is different. In a frequentist setting, $\textbf{w}$ is considered as a fixed parameter, whose value is determined by some 'estimator', and error bars of this estimate are obtained by the data set $D$.

One widely used frequentist estimator is maximum likelihood, in which $\textbf{w}$ is chosen to maximize the likelihood function. In machine learning literature, the negative log of the likelihood function is called an error function. Since negative log is monotonically decreasing, maximizing the function is equivalent to minimizing the error.

The Gaussian distribution

The famus normal or Gaussian distribution for a single-valued variable x is defined by $$ N(x|\mu,\sigma^2) = \frac{1}{(2\pi\sigma^2)^{1/2}}e^{-\frac{1}{2\sigma^2}(x-\mu)^2} $$ which is govened by two parameters, mean $\mu$ and variance $\sigma^2$. The reciprocal of the variance, written as $\beta=1/\sigma^2$, is called the precision. It's clear that Gaussian distribution is nonnegative and straightforward to show that the it's normalized. Thus it satisfies the two requirements for a valid probability density. We can easily find expectations and variance of x, which are $\mu$ and $\sigma^2$ respectively.

For $D$-dimensional vector $\textbf{x}$ of continuous variables, the Gaussian distribution is given by $$ N(\pmb{x}|\pmb{\mu},\pmb{\Sigma}) = \frac{1}{(2\pi)^2|\pmb{\Sigma}|^{1/2}}e^{-\frac{1}{2}(\pmb{x}-\pmb{\mu})^T\pmb{\Sigma}^{-1}(\pmb{x}-\pmb{\mu})} $$   Suppose we have a data set of $N$ observations $(x_1,…,x_N)^T$ drawn independently from a Guassian distribution with mean $\mu$ and variance $\sigma^2$. We call such data set independent and identically distributed or i.i.d. We can write the probability of such data set in the form $$ p(\mathbf{x}|\mu,\sigma^2) = \prod_{n=1}^\mathbf{N} N(x_n|\mu,\sigma^2) $$   Now we can determine the unknown parameters $\mu$ and $\sigma^2$ by maximizing the likelihood function above. In practice, it's more convenient to maximize the log of the function which is $$ \ln p(\mathbf{x}|\mu,\sigma^2)=-\frac{1}{2\sigma^2}\sum_{n=1}^{N}(x_n -\mu)-\frac{N}{2}\ln \sigma^2 - \frac{N}{2}\ln (2\pi) $$ Maximizing it with respect to $\mu$, we obtain the maximum likelihood solution given by $$ \mu_{ML} = \frac{1}{N}\sum_{n=1}^Nx_n $$ which is the sample mean. Similarly, maximizing it with respect to $\sigma^2​$, we obtain the maximum likelihood solution for the variance in the form $$ \sigma_{ML}^2 = \frac{1}{N}\sum_{n=1}^{N}(x_n-\mu_{ML})^2 $$ which is the sample variance w.r.t the sampel mean.

In fact, the maximum likelihood approach systematically underestimates the variance of the distribution. This is an example of a phenomenon called bias and is related to the problem of over-fitting. First note that the maximum likelihood solutions $\mu_{ML}$ and $\sigma^2_{ML}$ are functions of the data set values $x_1,…x_N$. The expectations of these quantities w.r.t to the data value can be shown to be $$ \begin{split} E[\mu_{ML}] &= \frac{1}{N}E[\sum_{n=1}^N x_n] =\mu \\ E[\sigma^2_{ML}] &= \frac{1}{N} E[\sum_{n=1}^{N} (x_n-\mu_{ML})^2] \\ &= \frac{1}{N} E[\sum_{n=1}^Nx_n^2 -2N\mu\mu_{ML} + N\mu_{ML}^2] \\ &= \frac{1}{N} E[\sum_{n=1}^Nx_n^2 - N\mu^2+ N(\mu-\mu_{ML})^2] \\ &= \frac{1}{N} {N(E[x^2] - E[x]^2) + N \frac{1}{N}\sigma^2 } \\ &= \frac{N-1}{N} \sigma^2 \end{split} $$ So on average, the maximum likelihood estimate will obtain the correct mean but will underestimate the true variance by a factor of $(N-1)/N$. Therefore, we use $$ \tilde{\sigma}^2 = \frac{N}{N-1} \sigma^2_{ML} = \frac{1}{N-1}\sum_{n=1}^{N}(x_n-\mu_{ML})^2 $$ to obtain an unbiased estimate of variance. Not that the bias of the maximum likelihood solution become less significant as the number N goes to infinity.

PageRank

Posted on 2017-05-26 | In Study Note , Bigdata , Others

PageRank: Bring Order to Web

<br>

PageRank algorithm is the fundamental algorithm for early search engines. In my Big Data course, I also gave a 30-minute presentation on this algorithm. My slide can be found <a href="../others/PageRank.pptx" style="color:blue">here</a>. In this note, I will give a brief introduction of the famous PageRank algorithm and extend to TF-IDF, where we reach a relatively thorough way to rank Web pages.

PageRank

​ First let's dive into the golden block of early Google, Page Rank.

Motivation

Since 1999, the graph of WWW (World Wide Web) has about 150 millions nodes (pages) and around 1.7 billion edges (links). Every Web page includes some links that link to other pages, and also there are links that link to this page in other page. To make it simple, we call all the links in the a Web page that points to others forward link and those in other pages that point to this one the backward link. It's obvious that given a Web page, we can easily find out all its forward links but it's hard to know all its backward links in the entire WWW structure.

In addition, we also need some other simple definition to explain PageRank. Let $u$ be a web page, define the following terms,

  1. $F_u$ is the forward set of $u$, which consists of all the pages that $u$ points to.
  2. $B_u$ is the backward set of $u$, which consists of all the pages that point to $u$.
  3. $N_u$ is the cardinality of $F_u$, which indicates the number of pages that $u$ points to.

<img src="/images/PageRank/ABC.png" width="150" align=center>

For example, in the system above, there are only 3 pages and arrows indicate links that connect the pages.. $F_A = {B,C}$, $B_A ={C}$ and $N_A= 2$.

Simple PageRank

Suppose we have a web page $u$, its PageRank $R(u)$ can be computed by

$$ R(u) = c \sum\limits_{v,\in, B_u}\frac{R(v)}{N_v} $$

Before I explain the formula, I need to point out that this is the first version of PageRank in the paper; to address more clearly, I added the word simple.

The idea here is pretty straightforward. The rank of a page depends on the contribution from its backward page. Notice that this is a recursive definition.

Example

<img src="/images/PageRank/example.png" width="150" align = center style="display:inline">

Let me walk through a example to better explain the equation. So in the system above, we have 4 pages. If the PageRank algorithm converges finally, we can assign almost any value (except the eigenvector of the transition matrix) to the pages at the beginning. Here we mimic a uniform distribution. In addition, we set $c$ in the formula to 1.

According to our definition, in the first iteration of the algorithm, $$ R(A) = \frac{R(B)}{2} + \frac{R(C)}{1} + \frac{R(D)}{3} = 0.458 $$   Those denominator in the fractions are $N_B$, $N_C$ and $N_D$ which we can eaily obtain by counting the arrows. Similarly, we can calculate

$$ \begin{split} R(B) = \frac{R(D}{3} = 0.083 \\
R(C) = \frac{R(B)}{2} + \frac{R(D}{3} = 0.25 \\
R(D) = \frac{R(A)}{1} = 0.25 \end{split} $$

Note that we don't use the newly calculated page rank inside an iteration.

Another way to look at the example is to define a transition matrix $T$, where $T_{ij}$ denotes the portion of rank flowing from page $j$ to page $i$.    $$ T = \left( \begin{matrix} 0 & .5 & 1 & .33 \ 0 & 0 & 0 & 0 \ 0 & .5 & 0 & .33 \ 1 & 0 & 0 & 0 \end{matrix} \right) $$

Then the new rank vector can be calculated by $$ R=cTR $$

Problem: Rank Sink

There are reasons why we call this one simple PageRank and not adapt it. One problem is Rank Sink.

<img src="/images/PageRank/sink.png" width="250">

Consider the above 3 page inside a web system where they form a loop. As we iterate, the rank of these pages will neve decrease but absorb all the rest PageRanks in the system. We call this problem rank sink

Formal PageRank

To overcome rank sink, people at Google introduce a vector $E(u)$, called rank source. In the paper, they formally define page rank $R'$ as $$ R'(u) = c \sum\limits_{v\in B_u}\frac{R'(v)}{N_v} + cE(u) $$ where c is maximized and the sum of all the ranks is 1.

Rank source $E$ is a vector over the whole Web to make up for rank sinks. Since nowadays people rather use damping factor, I will not discuss it in details here. How it works is explained in Larry Page's paper.

One thing to notice, one can customize the algorithm by choosing different $E$. For example, choose $E$ to be unifrom over all web page corresponds to a random surfer that may jump to random page in each iteration.

Damping Factor

However, the formal PageRank is still not the one people are using now. Later version has a more clever way to deal with problems like rank sink, damping factor. Damping factor $d$ is just a number between 0 and 1. The new recipe is $$ R(u) = \frac{1-d}{N} + d \sum\limits_{v\in B_u} \frac{R(v)}{N_v} $$   Damping factor $d$ replace the role of $c$ to adjust the final sum and $N$ here is the total number of pages in the whole system.

Think of the PageRank algorithm as a model of random surfer. In each iteration, with probability $d$, the surfer may choose a random link from the current page to go to; on the other hand, with probability $1-d$, the surfer gets bored and jumps to a random page in the system.

Therefore, the first term in the equation represent the probability of a surfer jumps to the current page and the second term corresponds to the probability that the surfer visits the page following a backward link. These two terms adds to the total probability that a surfer reaches a particular page at any time. Thus, the ranks of all the page form a probability distribution.

There is a nice demonstration in my <a href="../others/PageRank.pptx" style="color:blue">slide</a>. Feel free to download and go over it.

In reallity, one usually choose the damping factor to be 0.85. This is a perfect balance point between high rate of convergence and numeric stability.

TF-IDF

Another popular method to rank documents is Term Frequency - Inverse Document Frequency or TF-IDF. TF-IDF is the product of two numbers, term frequency and inverse document frequency.

Term frequency, as its name, is the frequency of a given word $t$ in document $f$, aka, $tf(t,f)=$ # $t$ in file $f$.

Inverse document frequency depends on the portion of documents that contain the target word among all the documents. For a given word $t$ and a collection of documents $F$, $idf(t,F) = log$(# files in $F$ / # files in $F$ that contains $t$ )

Then $tf\text{-}idf(t,f,F) = tf(t,f) \times idf(t,F)$.

Example

For example, given target word "Vanderbilt" and a bunch of files $f_1$, $f_2$, $f_3$ … $f_N$, the tf and idf of file $f_i$ are

tf("Vanderbilt", $f_i$) = # "Vanderbilt" in $f_i$

idf("Vanderbilt", {$f_i$}) = $log(\frac{N}{number;of;files;contain;"Vanderbilt"})$

And the $tf\text{-}idf$ of $f_i$ is the product of the above terms.

Comparison

Acute reader may have notice the difference between these two methods. PageRank only focuses on the link structure in the web system and ignore whatever content a page may contain. Conversely, $tf\text{-}idf$ completely depends on the content and neglect the linkage.

Therefore, people have try to combine the two methods with a weight parameter $$ Rank = w*PageRank + tf\text{-}idf $$

Conclusion

PageRank is an iterative link analysis algorithm to calculate relative importance of pages. The most popular version is the one that use damping factor: $$ R(u) = \frac{1-d}{N} + d \sum\limits_{v\in B_u} \frac{R(v)}{N_v} $$   TF-IDF is another useful method to rank pages. It makes up for the drawback of PageRank and therefore are often used together with PageRank.


Reference

[1]. The PageRank Citation Ranking: Bring Order to the Web, Larry Page, January 29, 1998.

1234
Ziqi Yang

Ziqi Yang

Welcome to my hub

34 posts
14 categories
14 tags
GitHub E-Mail
© 2018 Ziqi Yang
Powered by Hexo
|
Theme — NexT.Pisces v5.1.4