Time-complexity issue with PCA

00:36:35
https://www.youtube.com/watch?v=pwpDf2_Pytk

概要

TLDRThe video addresses the challenges of Principal Component Analysis (PCA) when the number of features (d) is significantly larger than the number of data points (n). It explains how to represent data points as a matrix and derive the covariance matrix using matrix notation. The video emphasizes the importance of Eigenvalues and Eigenvectors in PCA, detailing how to compute them efficiently by focusing on the smaller n x n matrix instead of the larger d x d matrix, thus solving the time complexity issue. The video concludes by hinting at a second issue related to PCA that will be discussed in the next video.

収穫

  • 📊 PCA deals with high-dimensional data where d >> n.
  • 🔍 Covariance matrix can be expressed as XX^T.
  • 🔑 Eigenvalues and Eigenvectors are crucial for PCA.
  • ⚙️ Solve time complexity by using n x n matrix K.
  • 📏 Non-zero Eigenvalues of XX^T and X^T X are the same.
  • 🧮 Final step involves normalizing Eigenvectors.
  • 🔗 Pairwise dot products capture similarity in data.
  • 🔄 Next video will address a second issue in PCA.

タイムライン

  • 00:00:00 - 00:05:00

    The video introduces the issue of high dimensionality in data analysis, highlighting that the number of features (d) can greatly exceed the number of data points (n). It simplifies notation by representing the data set as a d x n matrix with each column corresponding to a feature vector. The video explores how to derive the covariance matrix from this data representation and establish its relationship with PCA (Principal Component Analysis).

  • 00:05:00 - 00:10:00

    Acknowledging the need to find eigenvectors (wk) and eigenvalues (λk) from the covariance matrix, the video explains the properties of eigenvectors and introduces the concept of linear combinations of data points to derive these eigenvectors. It emphasizes that each eigenvector can be expressed as a linear combination of the data points if the eigenvalue is non-zero and explores how this relationship underpins PCA.

  • 00:10:00 - 00:15:00

    The video delves into the algebraic manipulation of equations relating to the eigenvectors and the data points. It highlights how to combine data points with specific weights (αk) to create eigenvectors. However, it recognizes that a challenge arises from needing to know these weights before they can be calculated, thus presenting a circular dependency in deriving the weights and eigenvectors.

  • 00:15:00 - 00:20:00

    The discussion transitions to finding a different method to derive the weights (αk) without a time-costly calculation of eigenvectors. The necessity to express eigenvectors in terms of their lengths leads to establishing a normalization condition. Consequently, the video stresses the importance of finding appropriate weights while maintaining the unit length requirement for eigenvectors, postulating how to target the relevant mathematical conditions.

  • 00:20:00 - 00:25:00

    The narrative shifts to exploring how to solve the eigenvector equations related to K = X^T * X, which is a smaller matrix that can be solved more efficiently. The relationship between the eigenvalues of the covariance matrix and the K matrix are discussed, establishing that they share the same non-zero eigenvalues. This insight paves the way for calculating eigenvectors of K instead of the original covariance matrix, thereby addressing the issue of computational efficiency.

  • 00:25:00 - 00:30:00

    A detailed algorithm is proposed to compute principal components. The method includes forming K, computing its eigenvalues and eigenvectors, and then deriving the weights for eigenvectors from the eigenvalues. This sequence ultimately allows for the computation of eigenvectors while working within constraints that permit a significantly reduced computational load when the number of features (d) is much larger than the number of data points (n).

  • 00:30:00 - 00:36:35

    The video concludes with a summary of the approach taken to overcome the computational challenges associated with PCA. It sets the stage for the next discussion on further issues, hinting at exploring how pairwise relationships between data points can simplify PCA implementations, even in cases where the data is not linearly related. The importance of understanding these relationships in machine learning contexts is underscored.

もっと見る

マインドマップ

ビデオQ&A

  • What is the main issue discussed in the video?

    The main issue is the time complexity of PCA when the number of features (d) is much larger than the number of data points (n).

  • How is the covariance matrix derived in PCA?

    The covariance matrix is derived using the formula 1/n * sum(x_i * x_i^T), which can be expressed in matrix notation as XX^T.

  • What is the significance of Eigenvalues and Eigenvectors in PCA?

    Eigenvalues and Eigenvectors are used to find the best-fit line in PCA, representing the directions of maximum variance in the data.

  • How does the video propose to solve the time complexity issue in PCA?

    The video suggests computing the smaller n x n matrix K = X^T * X instead of the larger d x d covariance matrix.

  • What is the relationship between the Eigenvalues of XX^T and X^T X?

    The non-zero Eigenvalues of XX^T and X^T X are exactly the same.

  • What is the final step in the proposed algorithm for PCA?

    The final step is to set α_k = β_k / sqrt(nλ_k) to obtain the Eigenvectors corresponding to the covariance matrix.

  • What will be discussed in the next video?

    The next video will address a second issue related to PCA and the importance of pairwise dot products between data points.

ビデオをもっと見る

AIを活用したYouTubeの無料動画要約に即アクセス!
字幕
en
オートスクロール:
  • 00:00:00
    What is issue 1? It is a large d.
  • 00:00:22
    In particular you can think of d is much much  larger than n. So, d is the number of features
  • 00:00:30
    and n is the number of data points.
  • 00:00:39
    So, let us do some, I mean notational  simplification. This will really help us,
  • 00:00:47
    solving this issue. So, let  us start by giving some,
  • 00:00:54
    defining some matrices. The first matrix  is the matrix of the data points.
  • 00:01:00
    Let us say we have the data  set which hasx1,x2,x3…..xn.
  • 00:01:08
    So, let us say we stack these data points as  vectors, d dimensional vectors in columns next
  • 00:01:15
    to each other. So, that is my data set itself  represented as a matrix which is a d x n matrix.
  • 00:01:24
    Now, the covariance matrix, we know is 1 over n  sum over i equals 1 to n x i x i transpose.
  • 00:01:35
    Now, I want to write this covariance matrix  in terms of this matrix that I have defined
  • 00:01:42
    here and that is possible, not too hard to  see. So, if you look at XX transpose, now,
  • 00:01:50
    which means that means that you are taking a d x n  matrix and then multiplying it with an n xd matrix
  • 00:02:00
    which is the transpose of the data points  where the data points are all rows now.
  • 00:02:05
    Now, this is exactly sum over i equals 1 to n  x I, x i transpose. So, this is perhaps you are
  • 00:02:16
    already seen why this is the k’s, if not try  to show this. So, try to show this exercise.
  • 00:02:27
    All I am saying is that your covariance matrix  which is 1 over n sum over i equals 1 to n x i,
  • 00:02:33
    x i transpose, this summation can actually be  expressed succinctly in matrix notation as XX
  • 00:02:40
    transpose. That just implies that our covariance  matrix is just 1 over n XX transpose.
  • 00:02:47
    So, say any d check, X is d x n, x transpose is n  cross d, XX transpose is d x d which is also the
  • 00:02:54
    dimension of our covariance matrix. Of course,  we are dividing it by n that is needed I mean
  • 00:03:01
    standard definition needs you to either divide by  n or (n -1) does not really matter, we will stick
  • 00:03:06
    to n at least in this course. So, now, what does  PCA do? PCA tries to find the Eigen vectors and
  • 00:03:16
    Eigen values of the covariance matrix or find the  best-fit line which happen to be Eigen vectors and
  • 00:03:22
    Eigen values of the covariance matrix. So, let us say wk be the Eigen vector
  • 00:03:37
    corresponding
  • 00:03:42
    to the k’th largest Eigen value of  C and let us call this Eigenvalue
  • 00:03:59
    λk. Now, you have a matrix C and then I am saying  that you take the Eigen values of this matrix,
  • 00:04:06
    arrange them in decreasing order,  λ1is the highest λ2is the second
  • 00:04:10
    highest and so on. And wk corresponds to  the Eigenvector associated with λk.
  • 00:04:20
    Now, what is the equation that an Eigenvector  satisfies and Eigenvector is a special direction
  • 00:04:26
    for a matrix where if the matrix acts  on this vector it just scales this
  • 00:04:31
    vector by some amount. It does not change the  direction. So, the direction is either scaling,
  • 00:04:36
    it could reverse the direction. So,  scaling could be negative that is still
  • 00:04:40
    okay but it does not change the direction. So, which means in equations, it means that Cwk=λk
  • 00:04:46
    wk. So, that is the definition of Eigen vector and  Eigen value. Now, what we want to understand is,
  • 00:05:02
    we want to find these Eigen directions  which are w1 to wk. Now, what can we say
  • 00:05:07
    about these wk’s? Where do these wk’s live?  Where do these Eigen directions live?
  • 00:05:13
    Of course, they are all d dimensional  vectors but can we say something more
  • 00:05:18
    about where these Eigenvectors live? That  is what we are attempting to find now,
  • 00:05:24
    Now, for that what we will do is we  will replace C with its definition
  • 00:05:30
    which is sum over i equals 1 to n x α x  α transpose times w k equals λ k w k.
  • 00:05:42
    Now, what I want to do is this is basically  algebra, I will retain this wk on the left-hand
  • 00:05:51
    side, bring the λk to the other side and  combine this xi and wk together. I can bring,
  • 00:05:59
    essentially, bringing that this wk inside, it  does not depend on i so, it can go inside. So,
  • 00:06:04
    this whole thing becomes 1 by n λ k sum over i  equals 1 to n x i transpose w k times x i.
  • 00:06:21
    If you are not immediately seeing why these two  equations mean the same thing, pause this video,
  • 00:06:27
    just work it out it is. I mean you should be  able to convince yourself that these two things
  • 00:06:32
    are exactly the same. It is just one-step  of algebra. Now, mean if you stare at this
  • 00:06:41
    equation for a while, something interesting  becomes clear. So, this is kind of seeing.
  • 00:06:47
    Now, let me, actually, even I can put this  in λk inside. So, it does not really matter.
  • 00:06:57
    So, I will tell you why I do this, yeah. So, both  are exactly the same. It is a constant. I mean it
  • 00:07:04
    does not depend on I, you can pull it inside. Now,  what is this saying? This is telling me that if I
  • 00:07:10
    want the k’th Eigen direction, so, Eigen direction  corresponding to the k’th largest Eigenvalue.
  • 00:07:17
    Now, I can express that as a summation of some  constant times xi which means I take the xi data
  • 00:07:29
    point multiplied with some number and then add up  all these scaled versions of xi. In other words,
  • 00:07:36
    this is essentially telling me that my wk  is a linear combination of data points.
  • 00:07:55
    So, we are going to assume λk is not  0 that. So, let us do that without
  • 00:08:01
    loss of generality in this particular k’s. So,  because otherwise this is going to not, I mean,
  • 00:08:11
    this is going to be infinity and then that would  not make sense. But for λk≠0 this is true.
  • 00:08:17
    Now, the interesting thing is that, wk  is a linear combination of data points.
  • 00:08:27
    Which means that somehow you need to combine your  data points to get your Eigen directions. Now,
  • 00:08:34
    for different Eigenvectors the way you combine  these data points are going to be different.
  • 00:08:40
    Now, what we care about in PCA is to get  these Eigen directions which means now we
  • 00:08:47
    can say that equivalently we can care about  getting these combinations of these data
  • 00:08:53
    points which give these Eigenvectors. Because once I have the Eigen vectors then
  • 00:08:57
    I know how to get a compressed representation and  all that, I can project each data point on to the
  • 00:09:02
    Eigenvectors we know that. We know what to do once  we have wk’s. But now, to get these wk’s itself
  • 00:09:08
    what we are saying is that it suffices to find  that combination of these data points that will
  • 00:09:13
    give me the wk’s. So, what does that mean? That means that we can say our wk equals our
  • 00:09:21
    matrix X, which is remember the matrix of the data  point stacked in columns multiplied by some αk
  • 00:09:30
    for some αk in Rn. Now, what does this  mean? This simply means that remember,
  • 00:09:40
    our data point is like this. So, x1 to, sorry,  there matrix X is like this x1 to xn.
  • 00:09:46
    Now, I am saying there is some αk which is in Rn  which means that it gives some weight to each of
  • 00:09:53
    this data point. The k says that it corresponds,  these weights corresponds to the Eigenvector wk.
  • 00:10:00
    So, αk1 to αkn. Now, if I multiply this, this  is just sum over i equals 1 to n α k i x i.
  • 00:10:14
    Of course, we know what this αki is. So, this  equation tells us that these weights are exactly
  • 00:10:20
    x i transpose w k by n λ k. But then to get these  weights from this equation, it appears that you
  • 00:10:26
    already need to know wk. So, this is x i transpose  w k by n λ k, you need to know wk to find w.
  • 00:10:34
    That is a chicken and egg problem. We cannot  use this equation to directly find these weights
  • 00:10:39
    because this equation needs what we are trying  to find in the first place which is wk.
  • 00:10:44
    So, let us leave this equation out. We are  saying here that is there a different way
  • 00:10:48
    that we can somehow find this αk’s. We first  recognize that there are these αk’s which
  • 00:10:55
    exactly is this but then let us forget that  it is this for the moment. But is there a
  • 00:11:00
    different way you can somehow get these αs. Because, if these weights, how should I weigh
  • 00:11:06
    these data points to combine them to get my  wk that is all I need. So, I to get my wk. So,
  • 00:11:12
    somehow if I can get these αs then I am  done. And there is an αk for each k. So,
  • 00:11:18
    remember that. Because for each  wk there is a different set of
  • 00:11:22
    weights. You have to combine the data  points differently to get our wk’s.
  • 00:11:29
    Now, how can we get these  αk’s? That is the how to get
  • 00:11:41
    αk’s, this is the question. Once we have this,  once if we can somehow efficiently get αk’s which
  • 00:11:48
    does not require order of d3 computation then  we are in business. Because the whole point was
  • 00:11:54
    our Eigen vector solvers are going to take  order of d cubed where d is the dimension,
  • 00:11:59
    d is the dimension or the number of features. If we can somehow get α without spending d cube
  • 00:12:05
    time then that is a good thing to do. So, how can  we do that is the question. So, we will do some
  • 00:12:12
    algebra here. It is more the algebra I will do it  for completion sake but then what comes out of it
  • 00:12:20
    is more important. But I would definitely suggest  you to try and follow the algebra as we do it.
  • 00:12:26
    So, we have wk=Xαk. We know that. We do not know  what α is but we know that there will exist some
  • 00:12:35
    α which is what we are trying to find. So, let  this be right here. So, we know Cwk=λk wk. That
  • 00:12:46
    is by definition of the Eigenvector, wk is an  Eigenvector. It has to satisfy this. We wrote
  • 00:12:53
    C as this, 1 by n X X transpose. This is something  that we wrote earlier. We also are saying wk=Xαk.
  • 00:13:06
    That is what we observed in just a  minute ago. So, this is λk Xαk.
  • 00:13:16
    Let me bring the n to the other  side and write this as X X transpose
  • 00:13:21
    times X α k equals n λ k X α k. Now, what I would  do is at this step is pre-multiply this equation
  • 00:13:38
    by X transpose. In other words, I am  saying, I will do X transpose times
  • 00:13:47
    whatever was there equals X transpose time  whatever was there. And what is there was X
  • 00:13:53
    X transpose X α k and we will see why  this is useful in a minute, Xαk.
  • 00:14:01
    Now, if I rearrange terms, this is I can do this,  I mean I cannot, matrix multiplication is not
  • 00:14:08
    commutative always, so, I cannot swap terms but  I can change the brackets. So, it is associative.
  • 00:14:15
    So, I can change the brackets however I wish. In  other words, I can do it as x x x transpose x into
  • 00:14:22
    x transpose x. Basically, I am combining these  two guys and these two guys into αk equals nλkis
  • 00:14:30
    a constant that can come outside. This x transpose  multiplies this x x transpose x into α k.
  • 00:14:39
    Now, this is a matrix, this is a matrix.  Now, x transpose x. Now, remember X was in R
  • 00:14:47
    d x n. So, x transpose x is in R, this is x  transpose n x d x n, so, this is n x n. So,
  • 00:14:57
    that is just to remember. Now, call x transpose x.  Let us just give it a name, let us call it K. Then
  • 00:15:07
    this equation is k squared α k equals n λ k, k α  k. So, we want the αk somehow and we are saying
  • 00:15:24
    whichever αk that is that you need to use to  combine these data points to get wk should satisfy
  • 00:15:31
    the equation k squared α k equals n λ k k α k. In other words, if we can find αk that satisfies,
  • 00:15:54
    there is a k on both sides, so, I am saying,  if we can find an αkthat satisfies Kαk=(nλk)αk
  • 00:16:05
    then we can multiply by K on both sides and  it would have satisfied the other equation as
  • 00:16:09
    well. So, which means that all we need to  find is an αk that satisfies Kαk=(nλk)αk.
  • 00:16:18
    If we can do that then we are kind of done. Now, this looks like an Eigen equation. So,
  • 00:16:27
    this looks like an Eigen equation. In fact,  this is an Eigen equation. So, basically,
  • 00:16:31
    we are saying that, if you take αk whatever this  αk is, if I apply this K matrix to this αk, it has
  • 00:16:40
    to scale this αk by nλk. So, if I can find such an  αk then I am done. So, this is an Eigen equation.
  • 00:16:53
    Now, let us say, I give you an αk. So, I  give you something and then claim that,
  • 00:17:01
    that satisfies this equation. So, let us say I give you some vector
  • 00:17:06
    u and then it satisfies this equation that  Ku=(nλk)u, let us say. Now, is this u unique?
  • 00:17:18
    No. So, now, what I can do is I  can do K2u will satisfy (nλk )2u.
  • 00:17:27
    Now, what does this tell us that every, I  mean, I can scale this u by any number and
  • 00:17:35
    then it will still satisfy this equation. So, which means that we need a specific way of
  • 00:17:40
    combining these data points. So, specific α that  will give us a wk. We know that the length of wk
  • 00:17:49
    is 1. So, we were looking for directions with  length 1. Now, which means that there should be
  • 00:17:55
    something that we can say about the length of αk  also or in general αk also. So, which means that
  • 00:18:01
    you cannot arbitrarily scale this u that satisfies  this equation and claim that all of these are
  • 00:18:08
    αk’s. So, now, what is that? What does that  tell us is what we are trying to find out?
  • 00:18:13
    Let us try to find out how the length  of wk gives us an indication of what
  • 00:18:19
    is the length that we should look for,  for αk. What do we know? So, we know
  • 00:18:30
    wk=Xαk. We saw that. So, we know that wk is some  combination of this which means w k transpose w k
  • 00:18:40
    which is the length is x α k transpose x α k.  Now, this is α k transpose x transpose x α k.
  • 00:18:53
    We know that we are looking for wk’s with the  length 1 which means the left-hand side is 1.
  • 00:18:58
    On the right-hand side, this x transpose x  shows up which is what we are calling as K.
  • 00:19:03
    Which means this is α k transpose k α k. So, now, we are saying that we need an αk that
  • 00:19:16
    satisfies this equation but it can not be  any αk, such an αk should also satisfy α k
  • 00:19:22
    transpose k α k equals 1. If you find an αk that  satisfies this equation and if that αk satisfies
  • 00:19:30
    α k transpose k α k equals 1 then that is the αk  that corresponds to wk which has length 1.
  • 00:19:40
    So, all this is algebra. All that we are saying  is that we wanted wk, we are saying we we can
  • 00:19:47
    equivalently solve for αk. And solving for αk  looks like solving for an Eigen equation in K.
  • 00:19:55
    But then the Eigenvectors length is unspecified.  So, we need to normalize the length and then
  • 00:20:02
    the fact that wk’s of length 1 implies that α k  transpose k α k should be 1. So, we should look
  • 00:20:09
    for αk’s Eigen equations but then you should also  have this property. So, this is first point.
  • 00:20:19
    So, which means, so, now,  we need one more important
  • 00:20:26
    theorem from linear algebra which will actually  be very very useful in understanding, in solving,
  • 00:20:33
    in completing this problem. And that  fact, I mean this is a linear algebra fact
  • 00:20:43
    or actually a theorem but I  will state it as a fact for now.
  • 00:20:52
    And we will see why this fact is useful.
  • 00:20:58
    So, essentially before I state this fact,  so, let me say why we need this fact.
  • 00:21:05
    Essentially, we wanted the Eigenvectors of  x x transpose but now we are saying we can
  • 00:21:12
    solve the Eigen equation of K, where K is just  x transpose x. Now, somehow this also involves
  • 00:21:21
    λk. We need to know λk which is the Eigen  value of X X transpose. But then we are only
  • 00:21:29
    solving the Eigen equation for X transpose X. So, now, is there a relation between the Eigen
  • 00:21:36
    values of 〖XX〗^ transpose and X^ transpose X?  So, because if we solve the Eigen equation for K
  • 00:21:43
    that will give you a set of Eigen vectors and  Eigen values, how are these Eigen values related
  • 00:21:48
    to the Eigen values that you would have gotten had  you solved the Eigen equation for 〖XX〗^ transpose.
  • 00:21:53
    That is the question we are asking. And the answer is this linear algebra fact.
  • 00:22:00
    The non-zero Eigenvalues of XX transpose  and X transpose X are exactly the same.
  • 00:22:22
    So, now, to make it precise, so, XX transpose  e is a vector, it is a matrix in d x d, X X
  • 00:22:26
    transpose is a matrix in n x n. But because  both of these come from the underlying x which
  • 00:22:37
    is d x n their Eigen values are related. In fact, if you are aware of the singular value
  • 00:22:43
    decomposition of matrices then you can simply use  that to prove this in two steps. I would not do
  • 00:22:49
    that. Feel free to try this. But what I am saying  is that the non-zero Eigenvalues, there might be
  • 00:22:56
    0 Eigen values because this is d x d, this is  n x n, so, the maximum number of non-zero Eigen
  • 00:23:01
    values of these matrices will be the minimum of  d and n. That is again a linear algebra fact.
  • 00:23:09
    So, there is going to be a lot of zero Eigen  values if d is large. But which there would not
  • 00:23:15
    be corresponding Eigen values in n. So, if n is  smaller than d then and if your matrix has full
  • 00:23:21
    length, so, then your number of non-zero Eigen  values will be n and they will match with the
  • 00:23:28
    top n Eigen values of XX transpose which is a  d x d matrix. So, what does this essentially
  • 00:23:37
    tell us? So, what is all of this telling  us now? It is telling us the following.
  • 00:23:42
    And this is the most important thing. We will now try to put everything together
  • 00:23:48
    and see all of these together. So, we  initially had C which was 1/n XX transpose
  • 00:23:57
    whose Eigen vectors and Eigen values we  wanted. So, let us say the Eigenvectors of C
  • 00:24:05
    where w1 to wl , some wl which we want. So,  now, for all k in from 1 to l we know w k
  • 00:24:18
    is 1, the length is 1. Now, the Eigen values corresponding to this are
  • 00:24:27
    λ1, in fact, they are in decreasing order, λ1> λ2  >…… λl . So, this is what we wanted to get. Now,
  • 00:24:42
    X, let us look at XX transpose, XX transpose  is just nC. Now, what will be the Eigen vectors
  • 00:24:51
    of length 1 for XX transpose. They can they will exactly be again w1
  • 00:24:57
    towl. It is just X, the matrix is just scaled.  So, the Eigen values will scale accordingly
  • 00:25:02
    but the Eigenvectors of length 1 will stay the  same. The Eigen values will nλ1> nλ2 >…… nλl.
  • 00:25:14
    So, we need this w1 towl somehow. Now, what are we saying? We are saying
  • 00:25:20
    we need to solve an Eigen equation of X  transpose X. This is the matrix which we
  • 00:25:29
    care about. Now, let us say the Eigenvectors of  X transpose X happen to be some vectors β1 to
  • 00:25:42
    βl, some Eigenvectors. Now, what we know? We  know that all the Eigenvectors length is 1.
  • 00:25:50
    All of these guys are of length 1 and let us say,  now, what are the corresponding Eigen values.
  • 00:25:57
    Now, the fact, linear algebra fact that I  stated before says that XX transpose and X
  • 00:26:02
    transpose X have exactly the same  non-zero Eigenvalues. Which means
  • 00:26:06
    the Eigen values of X transpose  X will also be nλ1> nλ2 >…… nλl.
  • 00:26:19
    So, this we have. So, now, what does  this mean? This means that this is K.
  • 00:26:27
    So, which means that Kβk=(nλk)βk. So, we  have found an Eigen vector βk which satisfies
  • 00:26:43
    Kβk=(nλk)βk. Now, question is what we wanted. We  wanted some αk’s which satisfies Kαk=(nλk)αk.
  • 00:26:56
    Now, here is something which satisfies,  we found a βk which satisfies Kβk=(nλk)βk.
  • 00:27:05
    But, is βk=αk? Can we solve the Eigen  equation like this and say is βk same as αk?
  • 00:27:14
    Well we saw that there is a length constraint  also on this αk’s. It is not just the fact that
  • 00:27:20
    it has to satisfy this equation, it also has  to satisfy α k transpose k α k must be 1.
  • 00:27:26
    So, which means, now, let us check with βk.  So, now, what is beta k transpose k beta k,
  • 00:27:35
    what is this value? If βk was αk then this has  to be 1. But what is this? By the virtue of the
  • 00:27:42
    fact that βk is an Eigen vector of K, so, Kβkis  (nλk)βk. So, this is beta k transpose n λ k beta k
  • 00:27:54
    which is n λ k beta k transpose beta k. But we know beta k is of length 1. So,
  • 00:28:02
    this is just nλk. So, βk Kβk is actually nλk.  But then we wanted, if we said βk is αk then
  • 00:28:13
    it is not going to work because then no longer  α k transpose k α k will be 1. So, there is a
  • 00:28:19
    scaling of nλk that is happening. So, which means we can set αk as
  • 00:28:26
    beta k divided by square root of n λ k. If you  do this, now, if once I set this for all k,
  • 00:28:42
    now, Kαk=(nλk)αk and 2 α k transpose k α k will  also be 1. Why? Because α k transpose k α k is
  • 00:29:01
    simply beta k transpose k beta k divided by the  square root is multiplied twice which is by n λ k.
  • 00:29:09
    But we know the numerator is nλk that is what we  argued here, divided by nλk. This is just 1.
  • 00:29:17
    So, now, this is kind of telling us, how to  convert your Eigen solutions for this matrix
  • 00:29:26
    k into the αk that we really needed. We will  talk about why this is a useful thing to do. So,
  • 00:29:33
    we are kind of going in circles and trying  to find αk but then I will tell you why
  • 00:29:37
    this is a good thing to do in a minute. So, now, here is what is the algorithm that
  • 00:29:42
    we are proposing. So, our input is just the  data set d which is { x1,….. xn} , where all
  • 00:29:52
    xi 's are in Rd and the assumption is  that d is much much larger than n. So,
  • 00:29:58
    this is the setup that we are in if we have too  many features. That is where the time complexity
  • 00:30:01
    is a problem. Now, what are we saying? We are saying that step 1, earlier, we would have
  • 00:30:09
    computed the covariance matrix and then we would  have computed the Eigenvectors and Eigenvalues.
  • 00:30:14
    We no longer want to do that because it is  expensive. So, what is the step 1? Step 1 compute
  • 00:30:21
    K= X transpose X. So, K is in Rn x n. Step  2. Compute Eigen decomposition of K.
  • 00:30:40
    Now, this is still an Eigen decomposition. Well  we wanted to avoid a costly Eigen decomposition
  • 00:30:48
    but then we are saying in step 2, we are doing an  Eigen decomposition of K. But note that K is an n
  • 00:30:54
    x n matrix. Which means the Eigen decomposition  of K is only going to take order of n cubed.
  • 00:31:00
    And if d is much much larger than n, essentially,  what you are saying is that instead of doing a d
  • 00:31:06
    cubed Eigen decomposition you can do an n cubed  Eigen decomposition. So, which might be much
  • 00:31:12
    cheaper, in general. So, we do Eigen decomposition  of K and we get Eigen vectors β1to βl
  • 00:31:22
    with corresponding Eigen values. These  are Eigen vectors, nλ1 …… nλl.
  • 00:31:44
    Once you have done this, so, this  is an order of n3 computation.
  • 00:31:52
    So, once this is done this then we are almost  done. So, step 3. We know that theseβk’s are
  • 00:31:58
    not exactly αk’s but then you have the Eigen  values also with. So, what you can do is set α k
  • 00:32:12
    equals beta k divided by square root of n λ  k for all k equals 1 to l. That is it.
  • 00:32:26
    So, once you have this then you can get  back your w's if you want. So, wk=Xαk.
  • 00:32:40
    So, essentially, what we have done is we  have gone in a different row root to find
  • 00:32:47
    our wk’s. Instead of directly solving  the Eigen decomposition, we are solving
  • 00:32:52
    a different matrix, a related matrix, the  Eigen decomposition of which is cheaper.
  • 00:32:56
    And then we are getting the Eigenvectors and  then converting that into the weights that you
  • 00:33:02
    need to combine the data points to get the  Eigen directions of the covariance matrix.
  • 00:33:07
    Now, this is pretty much solving problem  number 1, issue number 1. Because we are
  • 00:33:13
    not working with a huge d x d matrix but then  we are working with smaller n x n matrix.
  • 00:33:19
    Of course, this is helpful only if d is much  much larger than n. If n is larger than d,
  • 00:33:24
    you might as do the covariance matrix  decomposition. But then we are in that setup
  • 00:33:29
    where we are assuming that d is much much larger  than n and then we are trying to do this. So, this
  • 00:33:36
    solves issue 1, the time complexity issue. Now, note that you cannot, I mean you cannot get
  • 00:33:44
    away with the Eigen decomposition completely.  So, you have to do it. But then because there
  • 00:33:49
    are only two important parameters in this matrix,  one is d, one is n. We are just trying to make it
  • 00:33:55
    as simpler as possible by picking the smaller one  of, smaller of these. So, that is all issue 1.
  • 00:34:02
    Now, we want to talk about issue 2 in  the next video. But before going there,
  • 00:34:09
    I would want to make one observation which I  will again make in the next video but then I
  • 00:34:14
    want to hint it at this point and then we will  come back and connect it to the next video.
  • 00:34:19
    Now, what are we saying here,  we are given this data set,
  • 00:34:22
    the first step was to compute Kas X^  transpose X. So, what is kij in that case?
  • 00:34:31
    kijis xi transpose xj. You can verify that  this is what kijj would be, so, for all i ,j.
  • 00:34:41
    In some sense, this is capturing the similarity  between xi and xj, in the dot product sense.
  • 00:34:51
    More importantly, this also tells  us that to solve the PCA problem,
  • 00:34:59
    you only need these pairwise dot  products between the data points.
  • 00:35:06
    If you have that you have all the information  that you need to compute the PCA things.
  • 00:35:13
    So, whatever you want in PCA. We will make  this more precise next time and then we will
  • 00:35:19
    see that this important observation of the fact  that you only need this kind of a similarity
  • 00:35:24
    between these data points and not necessarily  the data points themselves always plays a very
  • 00:35:31
    important role in trying to use the solution of  issue 1 to actually solve for issue 2 which is a
  • 00:35:40
    totally different question. It says that what  if the data points are not linearly related.
  • 00:35:46
    It needs some creative thought to take  this solution and then solve issue 2
  • 00:35:52
    and that is, in fact, one of the very  important ideas in classical machine learning
  • 00:35:58
    and we will talk about that in the next video. For now, I would want to summarize saying that all
  • 00:36:03
    we have seen today is this set of videos is to  see how we have identified some issues with PCA
  • 00:36:12
    and then how you can solve the issue of time  complexity by converting your original problem
  • 00:36:19
    into a simpler problem in a computational sense  and solving the simpler problem will help you
  • 00:36:25
    solve the original problem. We will talk more  about issue 2 in the next video. Thank you.
タグ
  • PCA
  • Eigenvalues
  • Eigenvectors
  • Covariance Matrix
  • Time Complexity
  • Matrix Notation
  • Data Points
  • Dimensionality Reduction
  • Linear Algebra
  • Machine Learning