00:00:00
What is issue 1? It is a large d.
00:00:22
In particular you can think of d is much much
larger than n. So, d is the number of features
00:00:30
and n is the number of data points.
00:00:39
So, let us do some, I mean notational
simplification. This will really help us,
00:00:47
solving this issue. So, let
us start by giving some,
00:00:54
defining some matrices. The first matrix
is the matrix of the data points.
00:01:00
Let us say we have the data
set which hasx1,x2,x3…..xn.
00:01:08
So, let us say we stack these data points as
vectors, d dimensional vectors in columns next
00:01:15
to each other. So, that is my data set itself
represented as a matrix which is a d x n matrix.
00:01:24
Now, the covariance matrix, we know is 1 over n
sum over i equals 1 to n x i x i transpose.
00:01:35
Now, I want to write this covariance matrix
in terms of this matrix that I have defined
00:01:42
here and that is possible, not too hard to
see. So, if you look at XX transpose, now,
00:01:50
which means that means that you are taking a d x n
matrix and then multiplying it with an n xd matrix
00:02:00
which is the transpose of the data points
where the data points are all rows now.
00:02:05
Now, this is exactly sum over i equals 1 to n
x I, x i transpose. So, this is perhaps you are
00:02:16
already seen why this is the k’s, if not try
to show this. So, try to show this exercise.
00:02:27
All I am saying is that your covariance matrix
which is 1 over n sum over i equals 1 to n x i,
00:02:33
x i transpose, this summation can actually be
expressed succinctly in matrix notation as XX
00:02:40
transpose. That just implies that our covariance
matrix is just 1 over n XX transpose.
00:02:47
So, say any d check, X is d x n, x transpose is n
cross d, XX transpose is d x d which is also the
00:02:54
dimension of our covariance matrix. Of course,
we are dividing it by n that is needed I mean
00:03:01
standard definition needs you to either divide by
n or (n -1) does not really matter, we will stick
00:03:06
to n at least in this course. So, now, what does
PCA do? PCA tries to find the Eigen vectors and
00:03:16
Eigen values of the covariance matrix or find the
best-fit line which happen to be Eigen vectors and
00:03:22
Eigen values of the covariance matrix.
So, let us say wk be the Eigen vector
00:03:37
corresponding
00:03:42
to the k’th largest Eigen value of
C and let us call this Eigenvalue
00:03:59
λk. Now, you have a matrix C and then I am saying
that you take the Eigen values of this matrix,
00:04:06
arrange them in decreasing order,
λ1is the highest λ2is the second
00:04:10
highest and so on. And wk corresponds to
the Eigenvector associated with λk.
00:04:20
Now, what is the equation that an Eigenvector
satisfies and Eigenvector is a special direction
00:04:26
for a matrix where if the matrix acts
on this vector it just scales this
00:04:31
vector by some amount. It does not change the
direction. So, the direction is either scaling,
00:04:36
it could reverse the direction. So,
scaling could be negative that is still
00:04:40
okay but it does not change the direction.
So, which means in equations, it means that Cwk=λk
00:04:46
wk. So, that is the definition of Eigen vector and
Eigen value. Now, what we want to understand is,
00:05:02
we want to find these Eigen directions
which are w1 to wk. Now, what can we say
00:05:07
about these wk’s? Where do these wk’s live?
Where do these Eigen directions live?
00:05:13
Of course, they are all d dimensional
vectors but can we say something more
00:05:18
about where these Eigenvectors live? That
is what we are attempting to find now,
00:05:24
Now, for that what we will do is we
will replace C with its definition
00:05:30
which is sum over i equals 1 to n x α x
α transpose times w k equals λ k w k.
00:05:42
Now, what I want to do is this is basically
algebra, I will retain this wk on the left-hand
00:05:51
side, bring the λk to the other side and
combine this xi and wk together. I can bring,
00:05:59
essentially, bringing that this wk inside, it
does not depend on i so, it can go inside. So,
00:06:04
this whole thing becomes 1 by n λ k sum over i
equals 1 to n x i transpose w k times x i.
00:06:21
If you are not immediately seeing why these two
equations mean the same thing, pause this video,
00:06:27
just work it out it is. I mean you should be
able to convince yourself that these two things
00:06:32
are exactly the same. It is just one-step
of algebra. Now, mean if you stare at this
00:06:41
equation for a while, something interesting
becomes clear. So, this is kind of seeing.
00:06:47
Now, let me, actually, even I can put this
in λk inside. So, it does not really matter.
00:06:57
So, I will tell you why I do this, yeah. So, both
are exactly the same. It is a constant. I mean it
00:07:04
does not depend on I, you can pull it inside. Now,
what is this saying? This is telling me that if I
00:07:10
want the k’th Eigen direction, so, Eigen direction
corresponding to the k’th largest Eigenvalue.
00:07:17
Now, I can express that as a summation of some
constant times xi which means I take the xi data
00:07:29
point multiplied with some number and then add up
all these scaled versions of xi. In other words,
00:07:36
this is essentially telling me that my wk
is a linear combination of data points.
00:07:55
So, we are going to assume λk is not
0 that. So, let us do that without
00:08:01
loss of generality in this particular k’s. So,
because otherwise this is going to not, I mean,
00:08:11
this is going to be infinity and then that would
not make sense. But for λk≠0 this is true.
00:08:17
Now, the interesting thing is that, wk
is a linear combination of data points.
00:08:27
Which means that somehow you need to combine your
data points to get your Eigen directions. Now,
00:08:34
for different Eigenvectors the way you combine
these data points are going to be different.
00:08:40
Now, what we care about in PCA is to get
these Eigen directions which means now we
00:08:47
can say that equivalently we can care about
getting these combinations of these data
00:08:53
points which give these Eigenvectors.
Because once I have the Eigen vectors then
00:08:57
I know how to get a compressed representation and
all that, I can project each data point on to the
00:09:02
Eigenvectors we know that. We know what to do once
we have wk’s. But now, to get these wk’s itself
00:09:08
what we are saying is that it suffices to find
that combination of these data points that will
00:09:13
give me the wk’s. So, what does that mean?
That means that we can say our wk equals our
00:09:21
matrix X, which is remember the matrix of the data
point stacked in columns multiplied by some αk
00:09:30
for some αk in Rn. Now, what does this
mean? This simply means that remember,
00:09:40
our data point is like this. So, x1 to, sorry,
there matrix X is like this x1 to xn.
00:09:46
Now, I am saying there is some αk which is in Rn
which means that it gives some weight to each of
00:09:53
this data point. The k says that it corresponds,
these weights corresponds to the Eigenvector wk.
00:10:00
So, αk1 to αkn. Now, if I multiply this, this
is just sum over i equals 1 to n α k i x i.
00:10:14
Of course, we know what this αki is. So, this
equation tells us that these weights are exactly
00:10:20
x i transpose w k by n λ k. But then to get these
weights from this equation, it appears that you
00:10:26
already need to know wk. So, this is x i transpose
w k by n λ k, you need to know wk to find w.
00:10:34
That is a chicken and egg problem. We cannot
use this equation to directly find these weights
00:10:39
because this equation needs what we are trying
to find in the first place which is wk.
00:10:44
So, let us leave this equation out. We are
saying here that is there a different way
00:10:48
that we can somehow find this αk’s. We first
recognize that there are these αk’s which
00:10:55
exactly is this but then let us forget that
it is this for the moment. But is there a
00:11:00
different way you can somehow get these αs.
Because, if these weights, how should I weigh
00:11:06
these data points to combine them to get my
wk that is all I need. So, I to get my wk. So,
00:11:12
somehow if I can get these αs then I am
done. And there is an αk for each k. So,
00:11:18
remember that. Because for each
wk there is a different set of
00:11:22
weights. You have to combine the data
points differently to get our wk’s.
00:11:29
Now, how can we get these
αk’s? That is the how to get
00:11:41
αk’s, this is the question. Once we have this,
once if we can somehow efficiently get αk’s which
00:11:48
does not require order of d3 computation then
we are in business. Because the whole point was
00:11:54
our Eigen vector solvers are going to take
order of d cubed where d is the dimension,
00:11:59
d is the dimension or the number of features.
If we can somehow get α without spending d cube
00:12:05
time then that is a good thing to do. So, how can
we do that is the question. So, we will do some
00:12:12
algebra here. It is more the algebra I will do it
for completion sake but then what comes out of it
00:12:20
is more important. But I would definitely suggest
you to try and follow the algebra as we do it.
00:12:26
So, we have wk=Xαk. We know that. We do not know
what α is but we know that there will exist some
00:12:35
α which is what we are trying to find. So, let
this be right here. So, we know Cwk=λk wk. That
00:12:46
is by definition of the Eigenvector, wk is an
Eigenvector. It has to satisfy this. We wrote
00:12:53
C as this, 1 by n X X transpose. This is something
that we wrote earlier. We also are saying wk=Xαk.
00:13:06
That is what we observed in just a
minute ago. So, this is λk Xαk.
00:13:16
Let me bring the n to the other
side and write this as X X transpose
00:13:21
times X α k equals n λ k X α k. Now, what I would
do is at this step is pre-multiply this equation
00:13:38
by X transpose. In other words, I am
saying, I will do X transpose times
00:13:47
whatever was there equals X transpose time
whatever was there. And what is there was X
00:13:53
X transpose X α k and we will see why
this is useful in a minute, Xαk.
00:14:01
Now, if I rearrange terms, this is I can do this,
I mean I cannot, matrix multiplication is not
00:14:08
commutative always, so, I cannot swap terms but
I can change the brackets. So, it is associative.
00:14:15
So, I can change the brackets however I wish. In
other words, I can do it as x x x transpose x into
00:14:22
x transpose x. Basically, I am combining these
two guys and these two guys into αk equals nλkis
00:14:30
a constant that can come outside. This x transpose
multiplies this x x transpose x into α k.
00:14:39
Now, this is a matrix, this is a matrix.
Now, x transpose x. Now, remember X was in R
00:14:47
d x n. So, x transpose x is in R, this is x
transpose n x d x n, so, this is n x n. So,
00:14:57
that is just to remember. Now, call x transpose x.
Let us just give it a name, let us call it K. Then
00:15:07
this equation is k squared α k equals n λ k, k α
k. So, we want the αk somehow and we are saying
00:15:24
whichever αk that is that you need to use to
combine these data points to get wk should satisfy
00:15:31
the equation k squared α k equals n λ k k α k.
In other words, if we can find αk that satisfies,
00:15:54
there is a k on both sides, so, I am saying,
if we can find an αkthat satisfies Kαk=(nλk)αk
00:16:05
then we can multiply by K on both sides and
it would have satisfied the other equation as
00:16:09
well. So, which means that all we need to
find is an αk that satisfies Kαk=(nλk)αk.
00:16:18
If we can do that then we are kind of done.
Now, this looks like an Eigen equation. So,
00:16:27
this looks like an Eigen equation. In fact,
this is an Eigen equation. So, basically,
00:16:31
we are saying that, if you take αk whatever this
αk is, if I apply this K matrix to this αk, it has
00:16:40
to scale this αk by nλk. So, if I can find such an
αk then I am done. So, this is an Eigen equation.
00:16:53
Now, let us say, I give you an αk. So, I
give you something and then claim that,
00:17:01
that satisfies this equation.
So, let us say I give you some vector
00:17:06
u and then it satisfies this equation that
Ku=(nλk)u, let us say. Now, is this u unique?
00:17:18
No. So, now, what I can do is I
can do K2u will satisfy (nλk )2u.
00:17:27
Now, what does this tell us that every, I
mean, I can scale this u by any number and
00:17:35
then it will still satisfy this equation.
So, which means that we need a specific way of
00:17:40
combining these data points. So, specific α that
will give us a wk. We know that the length of wk
00:17:49
is 1. So, we were looking for directions with
length 1. Now, which means that there should be
00:17:55
something that we can say about the length of αk
also or in general αk also. So, which means that
00:18:01
you cannot arbitrarily scale this u that satisfies
this equation and claim that all of these are
00:18:08
αk’s. So, now, what is that? What does that
tell us is what we are trying to find out?
00:18:13
Let us try to find out how the length
of wk gives us an indication of what
00:18:19
is the length that we should look for,
for αk. What do we know? So, we know
00:18:30
wk=Xαk. We saw that. So, we know that wk is some
combination of this which means w k transpose w k
00:18:40
which is the length is x α k transpose x α k.
Now, this is α k transpose x transpose x α k.
00:18:53
We know that we are looking for wk’s with the
length 1 which means the left-hand side is 1.
00:18:58
On the right-hand side, this x transpose x
shows up which is what we are calling as K.
00:19:03
Which means this is α k transpose k α k.
So, now, we are saying that we need an αk that
00:19:16
satisfies this equation but it can not be
any αk, such an αk should also satisfy α k
00:19:22
transpose k α k equals 1. If you find an αk that
satisfies this equation and if that αk satisfies
00:19:30
α k transpose k α k equals 1 then that is the αk
that corresponds to wk which has length 1.
00:19:40
So, all this is algebra. All that we are saying
is that we wanted wk, we are saying we we can
00:19:47
equivalently solve for αk. And solving for αk
looks like solving for an Eigen equation in K.
00:19:55
But then the Eigenvectors length is unspecified.
So, we need to normalize the length and then
00:20:02
the fact that wk’s of length 1 implies that α k
transpose k α k should be 1. So, we should look
00:20:09
for αk’s Eigen equations but then you should also
have this property. So, this is first point.
00:20:19
So, which means, so, now,
we need one more important
00:20:26
theorem from linear algebra which will actually
be very very useful in understanding, in solving,
00:20:33
in completing this problem. And that
fact, I mean this is a linear algebra fact
00:20:43
or actually a theorem but I
will state it as a fact for now.
00:20:52
And we will see why this fact is useful.
00:20:58
So, essentially before I state this fact,
so, let me say why we need this fact.
00:21:05
Essentially, we wanted the Eigenvectors of
x x transpose but now we are saying we can
00:21:12
solve the Eigen equation of K, where K is just
x transpose x. Now, somehow this also involves
00:21:21
λk. We need to know λk which is the Eigen
value of X X transpose. But then we are only
00:21:29
solving the Eigen equation for X transpose X.
So, now, is there a relation between the Eigen
00:21:36
values of 〖XX〗^ transpose and X^ transpose X?
So, because if we solve the Eigen equation for K
00:21:43
that will give you a set of Eigen vectors and
Eigen values, how are these Eigen values related
00:21:48
to the Eigen values that you would have gotten had
you solved the Eigen equation for 〖XX〗^ transpose.
00:21:53
That is the question we are asking.
And the answer is this linear algebra fact.
00:22:00
The non-zero Eigenvalues of XX transpose
and X transpose X are exactly the same.
00:22:22
So, now, to make it precise, so, XX transpose
e is a vector, it is a matrix in d x d, X X
00:22:26
transpose is a matrix in n x n. But because
both of these come from the underlying x which
00:22:37
is d x n their Eigen values are related.
In fact, if you are aware of the singular value
00:22:43
decomposition of matrices then you can simply use
that to prove this in two steps. I would not do
00:22:49
that. Feel free to try this. But what I am saying
is that the non-zero Eigenvalues, there might be
00:22:56
0 Eigen values because this is d x d, this is
n x n, so, the maximum number of non-zero Eigen
00:23:01
values of these matrices will be the minimum of
d and n. That is again a linear algebra fact.
00:23:09
So, there is going to be a lot of zero Eigen
values if d is large. But which there would not
00:23:15
be corresponding Eigen values in n. So, if n is
smaller than d then and if your matrix has full
00:23:21
length, so, then your number of non-zero Eigen
values will be n and they will match with the
00:23:28
top n Eigen values of XX transpose which is a
d x d matrix. So, what does this essentially
00:23:37
tell us? So, what is all of this telling
us now? It is telling us the following.
00:23:42
And this is the most important thing.
We will now try to put everything together
00:23:48
and see all of these together. So, we
initially had C which was 1/n XX transpose
00:23:57
whose Eigen vectors and Eigen values we
wanted. So, let us say the Eigenvectors of C
00:24:05
where w1 to wl , some wl which we want. So,
now, for all k in from 1 to l we know w k
00:24:18
is 1, the length is 1.
Now, the Eigen values corresponding to this are
00:24:27
λ1, in fact, they are in decreasing order, λ1> λ2
>…… λl . So, this is what we wanted to get. Now,
00:24:42
X, let us look at XX transpose, XX transpose
is just nC. Now, what will be the Eigen vectors
00:24:51
of length 1 for XX transpose.
They can they will exactly be again w1
00:24:57
towl. It is just X, the matrix is just scaled.
So, the Eigen values will scale accordingly
00:25:02
but the Eigenvectors of length 1 will stay the
same. The Eigen values will nλ1> nλ2 >…… nλl.
00:25:14
So, we need this w1 towl somehow.
Now, what are we saying? We are saying
00:25:20
we need to solve an Eigen equation of X
transpose X. This is the matrix which we
00:25:29
care about. Now, let us say the Eigenvectors of
X transpose X happen to be some vectors β1 to
00:25:42
βl, some Eigenvectors. Now, what we know? We
know that all the Eigenvectors length is 1.
00:25:50
All of these guys are of length 1 and let us say,
now, what are the corresponding Eigen values.
00:25:57
Now, the fact, linear algebra fact that I
stated before says that XX transpose and X
00:26:02
transpose X have exactly the same
non-zero Eigenvalues. Which means
00:26:06
the Eigen values of X transpose
X will also be nλ1> nλ2 >…… nλl.
00:26:19
So, this we have. So, now, what does
this mean? This means that this is K.
00:26:27
So, which means that Kβk=(nλk)βk. So, we
have found an Eigen vector βk which satisfies
00:26:43
Kβk=(nλk)βk. Now, question is what we wanted. We
wanted some αk’s which satisfies Kαk=(nλk)αk.
00:26:56
Now, here is something which satisfies,
we found a βk which satisfies Kβk=(nλk)βk.
00:27:05
But, is βk=αk? Can we solve the Eigen
equation like this and say is βk same as αk?
00:27:14
Well we saw that there is a length constraint
also on this αk’s. It is not just the fact that
00:27:20
it has to satisfy this equation, it also has
to satisfy α k transpose k α k must be 1.
00:27:26
So, which means, now, let us check with βk.
So, now, what is beta k transpose k beta k,
00:27:35
what is this value? If βk was αk then this has
to be 1. But what is this? By the virtue of the
00:27:42
fact that βk is an Eigen vector of K, so, Kβkis
(nλk)βk. So, this is beta k transpose n λ k beta k
00:27:54
which is n λ k beta k transpose beta k.
But we know beta k is of length 1. So,
00:28:02
this is just nλk. So, βk Kβk is actually nλk.
But then we wanted, if we said βk is αk then
00:28:13
it is not going to work because then no longer
α k transpose k α k will be 1. So, there is a
00:28:19
scaling of nλk that is happening.
So, which means we can set αk as
00:28:26
beta k divided by square root of n λ k. If you
do this, now, if once I set this for all k,
00:28:42
now, Kαk=(nλk)αk and 2 α k transpose k α k will
also be 1. Why? Because α k transpose k α k is
00:29:01
simply beta k transpose k beta k divided by the
square root is multiplied twice which is by n λ k.
00:29:09
But we know the numerator is nλk that is what we
argued here, divided by nλk. This is just 1.
00:29:17
So, now, this is kind of telling us, how to
convert your Eigen solutions for this matrix
00:29:26
k into the αk that we really needed. We will
talk about why this is a useful thing to do. So,
00:29:33
we are kind of going in circles and trying
to find αk but then I will tell you why
00:29:37
this is a good thing to do in a minute.
So, now, here is what is the algorithm that
00:29:42
we are proposing. So, our input is just the
data set d which is { x1,….. xn} , where all
00:29:52
xi 's are in Rd and the assumption is
that d is much much larger than n. So,
00:29:58
this is the setup that we are in if we have too
many features. That is where the time complexity
00:30:01
is a problem. Now, what are we saying?
We are saying that step 1, earlier, we would have
00:30:09
computed the covariance matrix and then we would
have computed the Eigenvectors and Eigenvalues.
00:30:14
We no longer want to do that because it is
expensive. So, what is the step 1? Step 1 compute
00:30:21
K= X transpose X. So, K is in Rn x n. Step
2. Compute Eigen decomposition of K.
00:30:40
Now, this is still an Eigen decomposition. Well
we wanted to avoid a costly Eigen decomposition
00:30:48
but then we are saying in step 2, we are doing an
Eigen decomposition of K. But note that K is an n
00:30:54
x n matrix. Which means the Eigen decomposition
of K is only going to take order of n cubed.
00:31:00
And if d is much much larger than n, essentially,
what you are saying is that instead of doing a d
00:31:06
cubed Eigen decomposition you can do an n cubed
Eigen decomposition. So, which might be much
00:31:12
cheaper, in general. So, we do Eigen decomposition
of K and we get Eigen vectors β1to βl
00:31:22
with corresponding Eigen values. These
are Eigen vectors, nλ1 …… nλl.
00:31:44
Once you have done this, so, this
is an order of n3 computation.
00:31:52
So, once this is done this then we are almost
done. So, step 3. We know that theseβk’s are
00:31:58
not exactly αk’s but then you have the Eigen
values also with. So, what you can do is set α k
00:32:12
equals beta k divided by square root of n λ
k for all k equals 1 to l. That is it.
00:32:26
So, once you have this then you can get
back your w's if you want. So, wk=Xαk.
00:32:40
So, essentially, what we have done is we
have gone in a different row root to find
00:32:47
our wk’s. Instead of directly solving
the Eigen decomposition, we are solving
00:32:52
a different matrix, a related matrix, the
Eigen decomposition of which is cheaper.
00:32:56
And then we are getting the Eigenvectors and
then converting that into the weights that you
00:33:02
need to combine the data points to get the
Eigen directions of the covariance matrix.
00:33:07
Now, this is pretty much solving problem
number 1, issue number 1. Because we are
00:33:13
not working with a huge d x d matrix but then
we are working with smaller n x n matrix.
00:33:19
Of course, this is helpful only if d is much
much larger than n. If n is larger than d,
00:33:24
you might as do the covariance matrix
decomposition. But then we are in that setup
00:33:29
where we are assuming that d is much much larger
than n and then we are trying to do this. So, this
00:33:36
solves issue 1, the time complexity issue.
Now, note that you cannot, I mean you cannot get
00:33:44
away with the Eigen decomposition completely.
So, you have to do it. But then because there
00:33:49
are only two important parameters in this matrix,
one is d, one is n. We are just trying to make it
00:33:55
as simpler as possible by picking the smaller one
of, smaller of these. So, that is all issue 1.
00:34:02
Now, we want to talk about issue 2 in
the next video. But before going there,
00:34:09
I would want to make one observation which I
will again make in the next video but then I
00:34:14
want to hint it at this point and then we will
come back and connect it to the next video.
00:34:19
Now, what are we saying here,
we are given this data set,
00:34:22
the first step was to compute Kas X^
transpose X. So, what is kij in that case?
00:34:31
kijis xi transpose xj. You can verify that
this is what kijj would be, so, for all i ,j.
00:34:41
In some sense, this is capturing the similarity
between xi and xj, in the dot product sense.
00:34:51
More importantly, this also tells
us that to solve the PCA problem,
00:34:59
you only need these pairwise dot
products between the data points.
00:35:06
If you have that you have all the information
that you need to compute the PCA things.
00:35:13
So, whatever you want in PCA. We will make
this more precise next time and then we will
00:35:19
see that this important observation of the fact
that you only need this kind of a similarity
00:35:24
between these data points and not necessarily
the data points themselves always plays a very
00:35:31
important role in trying to use the solution of
issue 1 to actually solve for issue 2 which is a
00:35:40
totally different question. It says that what
if the data points are not linearly related.
00:35:46
It needs some creative thought to take
this solution and then solve issue 2
00:35:52
and that is, in fact, one of the very
important ideas in classical machine learning
00:35:58
and we will talk about that in the next video.
For now, I would want to summarize saying that all
00:36:03
we have seen today is this set of videos is to
see how we have identified some issues with PCA
00:36:12
and then how you can solve the issue of time
complexity by converting your original problem
00:36:19
into a simpler problem in a computational sense
and solving the simpler problem will help you
00:36:25
solve the original problem. We will talk more
about issue 2 in the next video. Thank you.