What is the main issue discussed in the video?

The main issue is the time complexity of PCA when the number of features (d) is much larger than the number of data points (n).

How is the covariance matrix derived in PCA?

The covariance matrix is derived using the formula 1/n * sum(x_i * x_i^T), which can be expressed in matrix notation as XX^T.

What is the significance of Eigenvalues and Eigenvectors in PCA?

Eigenvalues and Eigenvectors are used to find the best-fit line in PCA, representing the directions of maximum variance in the data.

How does the video propose to solve the time complexity issue in PCA?

The video suggests computing the smaller n x n matrix K = X^T * X instead of the larger d x d covariance matrix.

What is the relationship between the Eigenvalues of XX^T and X^T X?

The non-zero Eigenvalues of XX^T and X^T X are exactly the same.

What is the final step in the proposed algorithm for PCA?

The final step is to set α_k = β_k / sqrt(nλ_k) to obtain the Eigenvectors corresponding to the covariance matrix.

What will be discussed in the next video?

The next video will address a second issue related to PCA and the importance of pairwise dot products between data points.

Time-complexity issue with PCA

00:36:35

https://www.youtube.com/watch?v=pwpDf2_Pytk

Zusammenfassung

TLDRThe video addresses the challenges of Principal Component Analysis (PCA) when the number of features (d) is significantly larger than the number of data points (n). It explains how to represent data points as a matrix and derive the covariance matrix using matrix notation. The video emphasizes the importance of Eigenvalues and Eigenvectors in PCA, detailing how to compute them efficiently by focusing on the smaller n x n matrix instead of the larger d x d matrix, thus solving the time complexity issue. The video concludes by hinting at a second issue related to PCA that will be discussed in the next video.

Mitbringsel

📊 PCA deals with high-dimensional data where d >> n.
🔍 Covariance matrix can be expressed as XX^T.
🔑 Eigenvalues and Eigenvectors are crucial for PCA.
⚙️ Solve time complexity by using n x n matrix K.
📏 Non-zero Eigenvalues of XX^T and X^T X are the same.
🧮 Final step involves normalizing Eigenvectors.
🔗 Pairwise dot products capture similarity in data.
🔄 Next video will address a second issue in PCA.

Zeitleiste

00:00:00 - 00:05:00
The video introduces the issue of high dimensionality in data analysis, highlighting that the number of features (d) can greatly exceed the number of data points (n). It simplifies notation by representing the data set as a d x n matrix with each column corresponding to a feature vector. The video explores how to derive the covariance matrix from this data representation and establish its relationship with PCA (Principal Component Analysis).
00:05:00 - 00:10:00
Acknowledging the need to find eigenvectors (wk) and eigenvalues (λk) from the covariance matrix, the video explains the properties of eigenvectors and introduces the concept of linear combinations of data points to derive these eigenvectors. It emphasizes that each eigenvector can be expressed as a linear combination of the data points if the eigenvalue is non-zero and explores how this relationship underpins PCA.
00:10:00 - 00:15:00
The video delves into the algebraic manipulation of equations relating to the eigenvectors and the data points. It highlights how to combine data points with specific weights (αk) to create eigenvectors. However, it recognizes that a challenge arises from needing to know these weights before they can be calculated, thus presenting a circular dependency in deriving the weights and eigenvectors.
00:15:00 - 00:20:00
The discussion transitions to finding a different method to derive the weights (αk) without a time-costly calculation of eigenvectors. The necessity to express eigenvectors in terms of their lengths leads to establishing a normalization condition. Consequently, the video stresses the importance of finding appropriate weights while maintaining the unit length requirement for eigenvectors, postulating how to target the relevant mathematical conditions.
00:20:00 - 00:25:00
The narrative shifts to exploring how to solve the eigenvector equations related to K = X^T * X, which is a smaller matrix that can be solved more efficiently. The relationship between the eigenvalues of the covariance matrix and the K matrix are discussed, establishing that they share the same non-zero eigenvalues. This insight paves the way for calculating eigenvectors of K instead of the original covariance matrix, thereby addressing the issue of computational efficiency.
00:25:00 - 00:30:00
A detailed algorithm is proposed to compute principal components. The method includes forming K, computing its eigenvalues and eigenvectors, and then deriving the weights for eigenvectors from the eigenvalues. This sequence ultimately allows for the computation of eigenvectors while working within constraints that permit a significantly reduced computational load when the number of features (d) is much larger than the number of data points (n).
00:30:00 - 00:36:35
The video concludes with a summary of the approach taken to overcome the computational challenges associated with PCA. It sets the stage for the next discussion on further issues, hinting at exploring how pairwise relationships between data points can simplify PCA implementations, even in cases where the data is not linearly related. The importance of understanding these relationships in machine learning contexts is underscored.

Mind Map

Video-Fragen und Antworten

What is the main issue discussed in the video?
The main issue is the time complexity of PCA when the number of features (d) is much larger than the number of data points (n).
How is the covariance matrix derived in PCA?
The covariance matrix is derived using the formula 1/n * sum(x_i * x_i^T), which can be expressed in matrix notation as XX^T.
What is the significance of Eigenvalues and Eigenvectors in PCA?
Eigenvalues and Eigenvectors are used to find the best-fit line in PCA, representing the directions of maximum variance in the data.
How does the video propose to solve the time complexity issue in PCA?
The video suggests computing the smaller n x n matrix K = X^T * X instead of the larger d x d covariance matrix.
What is the relationship between the Eigenvalues of XX^T and X^T X?
The non-zero Eigenvalues of XX^T and X^T X are exactly the same.
What is the final step in the proposed algorithm for PCA?
The final step is to set α_k = β_k / sqrt(nλ_k) to obtain the Eigenvectors corresponding to the covariance matrix.
What will be discussed in the next video?
The next video will address a second issue related to PCA and the importance of pairwise dot products between data points.

Weitere Video-Zusammenfassungen anzeigen

Erhalten Sie sofortigen Zugang zu kostenlosen YouTube-Videozusammenfassungen, die von AI unterstützt werden!

Untertitel

Automatisches Blättern:

00:00:00
What is issue 1? It is a large d.
00:00:22
In particular you can think of d is much much larger than n. So, d is the number of features
00:00:30
and n is the number of data points.
00:00:39
So, let us do some, I mean notational simplification. This will really help us,
00:00:47
solving this issue. So, let us start by giving some,
00:00:54
defining some matrices. The first matrix is the matrix of the data points.
00:01:00
Let us say we have the data set which hasx1,x2,x3…..xn.
00:01:08
So, let us say we stack these data points as vectors, d dimensional vectors in columns next
00:01:15
to each other. So, that is my data set itself represented as a matrix which is a d x n matrix.
00:01:24
Now, the covariance matrix, we know is 1 over n sum over i equals 1 to n x i x i transpose.
00:01:35
Now, I want to write this covariance matrix in terms of this matrix that I have defined
00:01:42
here and that is possible, not too hard to see. So, if you look at XX transpose, now,
00:01:50
which means that means that you are taking a d x n matrix and then multiplying it with an n xd matrix
00:02:00
which is the transpose of the data points where the data points are all rows now.
00:02:05
Now, this is exactly sum over i equals 1 to n x I, x i transpose. So, this is perhaps you are
00:02:16
already seen why this is the k’s, if not try to show this. So, try to show this exercise.
00:02:27
All I am saying is that your covariance matrix which is 1 over n sum over i equals 1 to n x i,
00:02:33
x i transpose, this summation can actually be expressed succinctly in matrix notation as XX
00:02:40
transpose. That just implies that our covariance matrix is just 1 over n XX transpose.
00:02:47
So, say any d check, X is d x n, x transpose is n cross d, XX transpose is d x d which is also the
00:02:54
dimension of our covariance matrix. Of course, we are dividing it by n that is needed I mean
00:03:01
standard definition needs you to either divide by n or (n -1) does not really matter, we will stick
00:03:06
to n at least in this course. So, now, what does PCA do? PCA tries to find the Eigen vectors and
00:03:16
Eigen values of the covariance matrix or find the best-fit line which happen to be Eigen vectors and
00:03:22
Eigen values of the covariance matrix. So, let us say wk be the Eigen vector
00:03:37
corresponding
00:03:42
to the k’th largest Eigen value of C and let us call this Eigenvalue
00:03:59
λk. Now, you have a matrix C and then I am saying that you take the Eigen values of this matrix,
00:04:06
arrange them in decreasing order, λ1is the highest λ2is the second
00:04:10
highest and so on. And wk corresponds to the Eigenvector associated with λk.
00:04:20
Now, what is the equation that an Eigenvector satisfies and Eigenvector is a special direction
00:04:26
for a matrix where if the matrix acts on this vector it just scales this
00:04:31
vector by some amount. It does not change the direction. So, the direction is either scaling,
00:04:36
it could reverse the direction. So, scaling could be negative that is still
00:04:40
okay but it does not change the direction. So, which means in equations, it means that Cwk=λk
00:04:46
wk. So, that is the definition of Eigen vector and Eigen value. Now, what we want to understand is,
00:05:02
we want to find these Eigen directions which are w1 to wk. Now, what can we say
00:05:07
about these wk’s? Where do these wk’s live? Where do these Eigen directions live?
00:05:13
Of course, they are all d dimensional vectors but can we say something more
00:05:18
about where these Eigenvectors live? That is what we are attempting to find now,
00:05:24
Now, for that what we will do is we will replace C with its definition
00:05:30
which is sum over i equals 1 to n x α x α transpose times w k equals λ k w k.
00:05:42
Now, what I want to do is this is basically algebra, I will retain this wk on the left-hand
00:05:51
side, bring the λk to the other side and combine this xi and wk together. I can bring,
00:05:59
essentially, bringing that this wk inside, it does not depend on i so, it can go inside. So,
00:06:04
this whole thing becomes 1 by n λ k sum over i equals 1 to n x i transpose w k times x i.
00:06:21
If you are not immediately seeing why these two equations mean the same thing, pause this video,
00:06:27
just work it out it is. I mean you should be able to convince yourself that these two things
00:06:32
are exactly the same. It is just one-step of algebra. Now, mean if you stare at this
00:06:41
equation for a while, something interesting becomes clear. So, this is kind of seeing.
00:06:47
Now, let me, actually, even I can put this in λk inside. So, it does not really matter.
00:06:57
So, I will tell you why I do this, yeah. So, both are exactly the same. It is a constant. I mean it
00:07:04
does not depend on I, you can pull it inside. Now, what is this saying? This is telling me that if I
00:07:10
want the k’th Eigen direction, so, Eigen direction corresponding to the k’th largest Eigenvalue.
00:07:17
Now, I can express that as a summation of some constant times xi which means I take the xi data
00:07:29
point multiplied with some number and then add up all these scaled versions of xi. In other words,
00:07:36
this is essentially telling me that my wk is a linear combination of data points.
00:07:55
So, we are going to assume λk is not 0 that. So, let us do that without
00:08:01
loss of generality in this particular k’s. So, because otherwise this is going to not, I mean,
00:08:11
this is going to be infinity and then that would not make sense. But for λk≠0 this is true.
00:08:17
Now, the interesting thing is that, wk is a linear combination of data points.
00:08:27
Which means that somehow you need to combine your data points to get your Eigen directions. Now,
00:08:34
for different Eigenvectors the way you combine these data points are going to be different.
00:08:40
Now, what we care about in PCA is to get these Eigen directions which means now we
00:08:47
can say that equivalently we can care about getting these combinations of these data
00:08:53
points which give these Eigenvectors. Because once I have the Eigen vectors then
00:08:57
I know how to get a compressed representation and all that, I can project each data point on to the
00:09:02
Eigenvectors we know that. We know what to do once we have wk’s. But now, to get these wk’s itself
00:09:08
what we are saying is that it suffices to find that combination of these data points that will
00:09:13
give me the wk’s. So, what does that mean? That means that we can say our wk equals our
00:09:21
matrix X, which is remember the matrix of the data point stacked in columns multiplied by some αk
00:09:30
for some αk in Rn. Now, what does this mean? This simply means that remember,
00:09:40
our data point is like this. So, x1 to, sorry, there matrix X is like this x1 to xn.
00:09:46
Now, I am saying there is some αk which is in Rn which means that it gives some weight to each of
00:09:53
this data point. The k says that it corresponds, these weights corresponds to the Eigenvector wk.
00:10:00
So, αk1 to αkn. Now, if I multiply this, this is just sum over i equals 1 to n α k i x i.
00:10:14
Of course, we know what this αki is. So, this equation tells us that these weights are exactly
00:10:20
x i transpose w k by n λ k. But then to get these weights from this equation, it appears that you
00:10:26
already need to know wk. So, this is x i transpose w k by n λ k, you need to know wk to find w.
00:10:34
That is a chicken and egg problem. We cannot use this equation to directly find these weights
00:10:39
because this equation needs what we are trying to find in the first place which is wk.
00:10:44
So, let us leave this equation out. We are saying here that is there a different way
00:10:48
that we can somehow find this αk’s. We first recognize that there are these αk’s which
00:10:55
exactly is this but then let us forget that it is this for the moment. But is there a
00:11:00
different way you can somehow get these αs. Because, if these weights, how should I weigh
00:11:06
these data points to combine them to get my wk that is all I need. So, I to get my wk. So,
00:11:12
somehow if I can get these αs then I am done. And there is an αk for each k. So,
00:11:18
remember that. Because for each wk there is a different set of
00:11:22
weights. You have to combine the data points differently to get our wk’s.
00:11:29
Now, how can we get these αk’s? That is the how to get
00:11:41
αk’s, this is the question. Once we have this, once if we can somehow efficiently get αk’s which
00:11:48
does not require order of d3 computation then we are in business. Because the whole point was
00:11:54
our Eigen vector solvers are going to take order of d cubed where d is the dimension,
00:11:59
d is the dimension or the number of features. If we can somehow get α without spending d cube
00:12:05
time then that is a good thing to do. So, how can we do that is the question. So, we will do some
00:12:12
algebra here. It is more the algebra I will do it for completion sake but then what comes out of it
00:12:20
is more important. But I would definitely suggest you to try and follow the algebra as we do it.
00:12:26
So, we have wk=Xαk. We know that. We do not know what α is but we know that there will exist some
00:12:35
α which is what we are trying to find. So, let this be right here. So, we know Cwk=λk wk. That
00:12:46
is by definition of the Eigenvector, wk is an Eigenvector. It has to satisfy this. We wrote
00:12:53
C as this, 1 by n X X transpose. This is something that we wrote earlier. We also are saying wk=Xαk.
00:13:06
That is what we observed in just a minute ago. So, this is λk Xαk.
00:13:16
Let me bring the n to the other side and write this as X X transpose
00:13:21
times X α k equals n λ k X α k. Now, what I would do is at this step is pre-multiply this equation
00:13:38
by X transpose. In other words, I am saying, I will do X transpose times
00:13:47
whatever was there equals X transpose time whatever was there. And what is there was X
00:13:53
X transpose X α k and we will see why this is useful in a minute, Xαk.
00:14:01
Now, if I rearrange terms, this is I can do this, I mean I cannot, matrix multiplication is not
00:14:08
commutative always, so, I cannot swap terms but I can change the brackets. So, it is associative.
00:14:15
So, I can change the brackets however I wish. In other words, I can do it as x x x transpose x into
00:14:22
x transpose x. Basically, I am combining these two guys and these two guys into αk equals nλkis
00:14:30
a constant that can come outside. This x transpose multiplies this x x transpose x into α k.
00:14:39
Now, this is a matrix, this is a matrix. Now, x transpose x. Now, remember X was in R
00:14:47
d x n. So, x transpose x is in R, this is x transpose n x d x n, so, this is n x n. So,
00:14:57
that is just to remember. Now, call x transpose x. Let us just give it a name, let us call it K. Then
00:15:07
this equation is k squared α k equals n λ k, k α k. So, we want the αk somehow and we are saying
00:15:24
whichever αk that is that you need to use to combine these data points to get wk should satisfy
00:15:31
the equation k squared α k equals n λ k k α k. In other words, if we can find αk that satisfies,
00:15:54
there is a k on both sides, so, I am saying, if we can find an αkthat satisfies Kαk=(nλk)αk
00:16:05
then we can multiply by K on both sides and it would have satisfied the other equation as
00:16:09
well. So, which means that all we need to find is an αk that satisfies Kαk=(nλk)αk.
00:16:18
If we can do that then we are kind of done. Now, this looks like an Eigen equation. So,
00:16:27
this looks like an Eigen equation. In fact, this is an Eigen equation. So, basically,
00:16:31
we are saying that, if you take αk whatever this αk is, if I apply this K matrix to this αk, it has
00:16:40
to scale this αk by nλk. So, if I can find such an αk then I am done. So, this is an Eigen equation.
00:16:53
Now, let us say, I give you an αk. So, I give you something and then claim that,
00:17:01
that satisfies this equation. So, let us say I give you some vector
00:17:06
u and then it satisfies this equation that Ku=(nλk)u, let us say. Now, is this u unique?
00:17:18
No. So, now, what I can do is I can do K2u will satisfy (nλk )2u.
00:17:27
Now, what does this tell us that every, I mean, I can scale this u by any number and
00:17:35
then it will still satisfy this equation. So, which means that we need a specific way of
00:17:40
combining these data points. So, specific α that will give us a wk. We know that the length of wk
00:17:49
is 1. So, we were looking for directions with length 1. Now, which means that there should be
00:17:55
something that we can say about the length of αk also or in general αk also. So, which means that
00:18:01
you cannot arbitrarily scale this u that satisfies this equation and claim that all of these are
00:18:08
αk’s. So, now, what is that? What does that tell us is what we are trying to find out?
00:18:13
Let us try to find out how the length of wk gives us an indication of what
00:18:19
is the length that we should look for, for αk. What do we know? So, we know
00:18:30
wk=Xαk. We saw that. So, we know that wk is some combination of this which means w k transpose w k
00:18:40
which is the length is x α k transpose x α k. Now, this is α k transpose x transpose x α k.
00:18:53
We know that we are looking for wk’s with the length 1 which means the left-hand side is 1.
00:18:58
On the right-hand side, this x transpose x shows up which is what we are calling as K.
00:19:03
Which means this is α k transpose k α k. So, now, we are saying that we need an αk that
00:19:16
satisfies this equation but it can not be any αk, such an αk should also satisfy α k
00:19:22
transpose k α k equals 1. If you find an αk that satisfies this equation and if that αk satisfies
00:19:30
α k transpose k α k equals 1 then that is the αk that corresponds to wk which has length 1.
00:19:40
So, all this is algebra. All that we are saying is that we wanted wk, we are saying we we can
00:19:47
equivalently solve for αk. And solving for αk looks like solving for an Eigen equation in K.
00:19:55
But then the Eigenvectors length is unspecified. So, we need to normalize the length and then
00:20:02
the fact that wk’s of length 1 implies that α k transpose k α k should be 1. So, we should look
00:20:09
for αk’s Eigen equations but then you should also have this property. So, this is first point.
00:20:19
So, which means, so, now, we need one more important
00:20:26
theorem from linear algebra which will actually be very very useful in understanding, in solving,
00:20:33
in completing this problem. And that fact, I mean this is a linear algebra fact
00:20:43
or actually a theorem but I will state it as a fact for now.
00:20:52
And we will see why this fact is useful.
00:20:58
So, essentially before I state this fact, so, let me say why we need this fact.
00:21:05
Essentially, we wanted the Eigenvectors of x x transpose but now we are saying we can
00:21:12
solve the Eigen equation of K, where K is just x transpose x. Now, somehow this also involves
00:21:21
λk. We need to know λk which is the Eigen value of X X transpose. But then we are only
00:21:29
solving the Eigen equation for X transpose X. So, now, is there a relation between the Eigen
00:21:36
values of 〖XX〗^ transpose and X^ transpose X? So, because if we solve the Eigen equation for K
00:21:43
that will give you a set of Eigen vectors and Eigen values, how are these Eigen values related
00:21:48
to the Eigen values that you would have gotten had you solved the Eigen equation for 〖XX〗^ transpose.
00:21:53
That is the question we are asking. And the answer is this linear algebra fact.
00:22:00
The non-zero Eigenvalues of XX transpose and X transpose X are exactly the same.
00:22:22
So, now, to make it precise, so, XX transpose e is a vector, it is a matrix in d x d, X X
00:22:26
transpose is a matrix in n x n. But because both of these come from the underlying x which
00:22:37
is d x n their Eigen values are related. In fact, if you are aware of the singular value
00:22:43
decomposition of matrices then you can simply use that to prove this in two steps. I would not do
00:22:49
that. Feel free to try this. But what I am saying is that the non-zero Eigenvalues, there might be
00:22:56
0 Eigen values because this is d x d, this is n x n, so, the maximum number of non-zero Eigen
00:23:01
values of these matrices will be the minimum of d and n. That is again a linear algebra fact.
00:23:09
So, there is going to be a lot of zero Eigen values if d is large. But which there would not
00:23:15
be corresponding Eigen values in n. So, if n is smaller than d then and if your matrix has full
00:23:21
length, so, then your number of non-zero Eigen values will be n and they will match with the
00:23:28
top n Eigen values of XX transpose which is a d x d matrix. So, what does this essentially
00:23:37
tell us? So, what is all of this telling us now? It is telling us the following.
00:23:42
And this is the most important thing. We will now try to put everything together
00:23:48
and see all of these together. So, we initially had C which was 1/n XX transpose
00:23:57
whose Eigen vectors and Eigen values we wanted. So, let us say the Eigenvectors of C
00:24:05
where w1 to wl , some wl which we want. So, now, for all k in from 1 to l we know w k
00:24:18
is 1, the length is 1. Now, the Eigen values corresponding to this are
00:24:27
λ1, in fact, they are in decreasing order, λ1> λ2 >…… λl . So, this is what we wanted to get. Now,
00:24:42
X, let us look at XX transpose, XX transpose is just nC. Now, what will be the Eigen vectors
00:24:51
of length 1 for XX transpose. They can they will exactly be again w1
00:24:57
towl. It is just X, the matrix is just scaled. So, the Eigen values will scale accordingly
00:25:02
but the Eigenvectors of length 1 will stay the same. The Eigen values will nλ1> nλ2 >…… nλl.
00:25:14
So, we need this w1 towl somehow. Now, what are we saying? We are saying
00:25:20
we need to solve an Eigen equation of X transpose X. This is the matrix which we
00:25:29
care about. Now, let us say the Eigenvectors of X transpose X happen to be some vectors β1 to
00:25:42
βl, some Eigenvectors. Now, what we know? We know that all the Eigenvectors length is 1.
00:25:50
All of these guys are of length 1 and let us say, now, what are the corresponding Eigen values.
00:25:57
Now, the fact, linear algebra fact that I stated before says that XX transpose and X
00:26:02
transpose X have exactly the same non-zero Eigenvalues. Which means
00:26:06
the Eigen values of X transpose X will also be nλ1> nλ2 >…… nλl.
00:26:19
So, this we have. So, now, what does this mean? This means that this is K.
00:26:27
So, which means that Kβk=(nλk)βk. So, we have found an Eigen vector βk which satisfies
00:26:43
Kβk=(nλk)βk. Now, question is what we wanted. We wanted some αk’s which satisfies Kαk=(nλk)αk.
00:26:56
Now, here is something which satisfies, we found a βk which satisfies Kβk=(nλk)βk.
00:27:05
But, is βk=αk? Can we solve the Eigen equation like this and say is βk same as αk?
00:27:14
Well we saw that there is a length constraint also on this αk’s. It is not just the fact that
00:27:20
it has to satisfy this equation, it also has to satisfy α k transpose k α k must be 1.
00:27:26
So, which means, now, let us check with βk. So, now, what is beta k transpose k beta k,
00:27:35
what is this value? If βk was αk then this has to be 1. But what is this? By the virtue of the
00:27:42
fact that βk is an Eigen vector of K, so, Kβkis (nλk)βk. So, this is beta k transpose n λ k beta k
00:27:54
which is n λ k beta k transpose beta k. But we know beta k is of length 1. So,
00:28:02
this is just nλk. So, βk Kβk is actually nλk. But then we wanted, if we said βk is αk then
00:28:13
it is not going to work because then no longer α k transpose k α k will be 1. So, there is a
00:28:19
scaling of nλk that is happening. So, which means we can set αk as
00:28:26
beta k divided by square root of n λ k. If you do this, now, if once I set this for all k,
00:28:42
now, Kαk=(nλk)αk and 2 α k transpose k α k will also be 1. Why? Because α k transpose k α k is
00:29:01
simply beta k transpose k beta k divided by the square root is multiplied twice which is by n λ k.
00:29:09
But we know the numerator is nλk that is what we argued here, divided by nλk. This is just 1.
00:29:17
So, now, this is kind of telling us, how to convert your Eigen solutions for this matrix
00:29:26
k into the αk that we really needed. We will talk about why this is a useful thing to do. So,
00:29:33
we are kind of going in circles and trying to find αk but then I will tell you why
00:29:37
this is a good thing to do in a minute. So, now, here is what is the algorithm that
00:29:42
we are proposing. So, our input is just the data set d which is { x1,….. xn} , where all
00:29:52
xi 's are in Rd and the assumption is that d is much much larger than n. So,
00:29:58
this is the setup that we are in if we have too many features. That is where the time complexity
00:30:01
is a problem. Now, what are we saying? We are saying that step 1, earlier, we would have
00:30:09
computed the covariance matrix and then we would have computed the Eigenvectors and Eigenvalues.
00:30:14
We no longer want to do that because it is expensive. So, what is the step 1? Step 1 compute
00:30:21
K= X transpose X. So, K is in Rn x n. Step 2. Compute Eigen decomposition of K.
00:30:40
Now, this is still an Eigen decomposition. Well we wanted to avoid a costly Eigen decomposition
00:30:48
but then we are saying in step 2, we are doing an Eigen decomposition of K. But note that K is an n
00:30:54
x n matrix. Which means the Eigen decomposition of K is only going to take order of n cubed.
00:31:00
And if d is much much larger than n, essentially, what you are saying is that instead of doing a d
00:31:06
cubed Eigen decomposition you can do an n cubed Eigen decomposition. So, which might be much
00:31:12
cheaper, in general. So, we do Eigen decomposition of K and we get Eigen vectors β1to βl
00:31:22
with corresponding Eigen values. These are Eigen vectors, nλ1 …… nλl.
00:31:44
Once you have done this, so, this is an order of n3 computation.
00:31:52
So, once this is done this then we are almost done. So, step 3. We know that theseβk’s are
00:31:58
not exactly αk’s but then you have the Eigen values also with. So, what you can do is set α k
00:32:12
equals beta k divided by square root of n λ k for all k equals 1 to l. That is it.
00:32:26
So, once you have this then you can get back your w's if you want. So, wk=Xαk.
00:32:40
So, essentially, what we have done is we have gone in a different row root to find
00:32:47
our wk’s. Instead of directly solving the Eigen decomposition, we are solving
00:32:52
a different matrix, a related matrix, the Eigen decomposition of which is cheaper.
00:32:56
And then we are getting the Eigenvectors and then converting that into the weights that you
00:33:02
need to combine the data points to get the Eigen directions of the covariance matrix.
00:33:07
Now, this is pretty much solving problem number 1, issue number 1. Because we are
00:33:13
not working with a huge d x d matrix but then we are working with smaller n x n matrix.
00:33:19
Of course, this is helpful only if d is much much larger than n. If n is larger than d,
00:33:24
you might as do the covariance matrix decomposition. But then we are in that setup
00:33:29
where we are assuming that d is much much larger than n and then we are trying to do this. So, this
00:33:36
solves issue 1, the time complexity issue. Now, note that you cannot, I mean you cannot get
00:33:44
away with the Eigen decomposition completely. So, you have to do it. But then because there
00:33:49
are only two important parameters in this matrix, one is d, one is n. We are just trying to make it
00:33:55
as simpler as possible by picking the smaller one of, smaller of these. So, that is all issue 1.
00:34:02
Now, we want to talk about issue 2 in the next video. But before going there,
00:34:09
I would want to make one observation which I will again make in the next video but then I
00:34:14
want to hint it at this point and then we will come back and connect it to the next video.
00:34:19
Now, what are we saying here, we are given this data set,
00:34:22
the first step was to compute Kas X^ transpose X. So, what is kij in that case?
00:34:31
kijis xi transpose xj. You can verify that this is what kijj would be, so, for all i ,j.
00:34:41
In some sense, this is capturing the similarity between xi and xj, in the dot product sense.
00:34:51
More importantly, this also tells us that to solve the PCA problem,
00:34:59
you only need these pairwise dot products between the data points.
00:35:06
If you have that you have all the information that you need to compute the PCA things.
00:35:13
So, whatever you want in PCA. We will make this more precise next time and then we will
00:35:19
see that this important observation of the fact that you only need this kind of a similarity
00:35:24
between these data points and not necessarily the data points themselves always plays a very
00:35:31
important role in trying to use the solution of issue 1 to actually solve for issue 2 which is a
00:35:40
totally different question. It says that what if the data points are not linearly related.
00:35:46
It needs some creative thought to take this solution and then solve issue 2
00:35:52
and that is, in fact, one of the very important ideas in classical machine learning
00:35:58
and we will talk about that in the next video. For now, I would want to summarize saying that all
00:36:03
we have seen today is this set of videos is to see how we have identified some issues with PCA
00:36:12
and then how you can solve the issue of time complexity by converting your original problem
00:36:19
into a simpler problem in a computational sense and solving the simpler problem will help you
00:36:25
solve the original problem. We will talk more about issue 2 in the next video. Thank you.