InfiniBand Principles Every HPC Expert MUST Know (Part 1)

01:16:53
https://www.youtube.com/watch?v=wecZb5lHkXk

Summary

TLDREd Paws, from Mellanox, introduces the basics of InfiniBand technology and Mellanox Academy in this presentation. The focus is on educating HPC experts on the foundational elements of InfiniBand, including its architecture and components like Leaf and Spine switches, and the role of the Subnet Manager in network management. InfiniBand is highlighted for its high-speed, low-latency performance in data traffic movement, making it ideal for high-performance computing environments. The session also covers Mellanox Academy's offerings, which include a variety of training methods like instructor-led sessions, online courses, and certifications to enhance knowledge of Mellanox products and technologies. Key technologies discussed include remote direct memory access (RDMA) for reduced latency and CPU usage, as well as upcoming advancements like enhanced data rate (EDR) in InfiniBand links.

Takeaways

  • 👋 Ed Paws from Mellanox introduces himself and the session.
  • 📚 Mellanox Academy provides diverse training formats.
  • 🚀 InfiniBand excels in high-speed, low-latency data transport.
  • 🔗 Current links are 56 Gbps, aiming for 100 Gbps soon.
  • 🔧 Subnet Manager automates network settings.
  • 🖥️ Leaf and Spine are key switch types in the architecture.
  • 🏎️ RDMA boosts performance by bypassing the kernel.
  • 🌌 InfiniBand networks are designed for scalability.
  • 🔑 GUIDs are unique identifiers in InfiniBand.
  • 🛡️ Partitioning offers security and QoS in networks.

Timeline

  • 00:00:00 - 00:05:00

    The speaker introduces himself as a training professional from Mellanox, focusing on InfiniBand technology relevant to HPC experts. He highlights the offerings of Mellanox Academy, including various training services and resources.

  • 00:05:00 - 00:10:00

    The session aims to cover InfiniBand essentials for HPC experts, focusing on fabric architecture, network components, and Mellanox technologies. The speaker encourages questions and emphasizes practical application of concepts.

  • 00:10:00 - 00:15:00

    InfiniBand provides a high-performance solution for HPC environments, supporting speeds up to 56 Gbps and expected increases to 100 Gbps. Mellanox offers various generations of technology enhancing data rate and latency.

  • 00:15:00 - 00:20:00

    Mellanox provides host channel adapters and switches integral to InfiniBand technology, supporting flexible scalability and superior performance in HPC environments.

  • 00:20:00 - 00:25:00

    The speaker explains the InfiniBand generation history, detailing the evolution from 10 Gbps SDR to 56 Gbps FDR, and towards 100 Gbps EDR. Critical advancements include reduced latency and improved efficiency via RDMA.

  • 00:25:00 - 00:30:00

    InfiniBand was designed for high performance and reliability in network communications, offering low latency and high bandwidth. It's managed via software-defined networking, specifically by a subnet manager.

  • 00:30:00 - 00:35:00

    The subnet manager, a key entity in InfiniBand networks, facilitates configuration and redundancy. InfiniBand's use of RDMA minimizes CPU load and enhances data transfer efficiency, crucial for HPC tasks.

  • 00:35:00 - 00:40:00

    InfiniBand architecture features include host channel adapters, switches for subnet connectivity, and gateways for cross-protocol communication. It's primarily a layer 2 protocol facilitating efficient node communication.

  • 00:40:00 - 00:45:00

    Certain environments may require gateways to bridge InfiniBand with Ethernet or Fibre Channel, allowing cross-protocol data exchange typically managed through IP over IB standards.

  • 00:45:00 - 00:50:00

    Subnet managers automate configuration in InfiniBand networks as opposed to manual setups typical in Ethernet. Redundancy ensures network reliability through active and standby subnet managers.

  • 00:50:00 - 00:55:00

    Layer 2 addressing in InfiniBand is handled through local identifiers, managed by the subnet manager. These are subject to change on reboot unless configured otherwise for stability.

  • 00:55:00 - 01:00:00

    Fabric architecture in InfiniBand involves hierarchical topology with leaf and spine layers. Leaf switches connect directly to nodes, while spine switches enable cross-leaf communication for scalability.

  • 01:00:00 - 01:05:00

    InfiniBand supports semi-non-blocking environments through strategic port allocation, aiming to balance server-to-server bandwidth across the network fabric.

  • 01:05:00 - 01:10:00

    Partitioning in InfiniBand allows virtual segmentation within a single subnet, facilitating distinct operational environments or security boundaries for different applications or clients.

  • 01:10:00 - 01:16:53

    The speaker concludes the session, highlighting InfiniBand's ability to support varied network demands through adaptable architecture and partitioning. The session pauses for a break.

Show more

Mind Map

Video Q&A

  • What is Mellanox Academy?

    Mellanox Academy is an organization that provides training services, including instructor-led, web-based, and e-learning, to help individuals gain knowledge in InfiniBand and other Mellanox technologies.

  • What is InfiniBand used for?

    InfiniBand is used for high-performance computing environments to move data traffic with high speed, low latency, and increased bandwidth.

  • What speeds do current InfiniBand links support?

    Current InfiniBand links support speeds up to 56 gigabits per second, with plans to reach 100 gigabits per second.

  • What is RDMA?

    RDMA (Remote Direct Memory Access) is a technology that allows data to move directly between computers, bypassing the kernel and reducing latency and CPU load.

  • What is the main role of the Subnet Manager in InfiniBand?

    The Subnet Manager is responsible for managing the network configurations of InfiniBand, including assigning layer 2 addresses (LIDs) and ensuring network stability.

  • What is the difference between a Leaf switch and a Spine switch?

    A Leaf switch connects directly to servers, while a Spine switch connects multiple Leaf switches, facilitating inter-server communication.

  • What does non-blocking mean in the context of network switches?

    Non-blocking in network switches means the network can handle arbitrary heavy traffic without delays, ensuring that the bandwidth in the interconnection fabric matches the demands of the server connections.

  • How is the physical address identified in InfiniBand networks?

    In InfiniBand networks, the physical address is identified by a GUID (Global Unique Identifier), which is a 64-bit address assigned to every device.

  • What is a Gateway in InfiniBand networks?

    A Gateway is a device used to convert and transfer data packets between different network protocols such as Ethernet and InfiniBand.

  • How do InfiniBand networks achieve software-defined networking (SDN)?

    InfiniBand networks achieve SDN through the Subnet Manager, which automates and manages all network configurations and parameters.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!
Subtitles
en
Auto Scroll:
  • 00:00:03
    okay I'll begin first with the
  • 00:00:06
    introduction again my name is dead paws
  • 00:00:09
    I'm from Mellanox I'm working in
  • 00:00:12
    Mellanox for three and a half years
  • 00:00:14
    about and doing training I'm part of a
  • 00:00:18
    support division which means we are
  • 00:00:21
    involved in all the issues if you like
  • 00:00:23
    and technologies that Mellanox is
  • 00:00:26
    dealing with some of you might have met
  • 00:00:30
    me maybe last year on Twitter go well
  • 00:00:33
    last year and my main intention today is
  • 00:00:40
    actually to provide the HPC experts
  • 00:00:44
    because you are actually HPC experts to
  • 00:00:48
    the guys among you which are not
  • 00:00:51
    familiar with Beth basics of InfiniBand
  • 00:00:56
    and the most important issues in
  • 00:01:00
    InfiniBand skeleton if you like or
  • 00:01:03
    fabric I would like to make it clear to
  • 00:01:07
    most of us so this is my main intention
  • 00:01:11
    today I have additional small target and
  • 00:01:16
    the small target is to provide you an
  • 00:01:19
    information about Mellanox academy
  • 00:01:24
    Mellanox academy actually is a
  • 00:01:28
    organization that can help you get
  • 00:01:32
    different type of information we still
  • 00:01:35
    have problems with that yeah okay so we
  • 00:01:44
    would like to allow you to offer you
  • 00:01:47
    training services that could be online
  • 00:01:50
    service services that could be a
  • 00:01:54
    instructor-led training could be also
  • 00:01:58
    web training or okay that's better I
  • 00:02:00
    suppose and this is part of Mellanox
  • 00:02:03
    Academy so a lot of subjects that some
  • 00:02:09
    of you have holes of knowledge in them
  • 00:02:13
    are covered as part of Mellanox Academy
  • 00:02:16
    and as you can see as I was saying we
  • 00:02:20
    are providing in a instructor-led
  • 00:02:23
    training which is me and some other
  • 00:02:26
    people we have online training and we
  • 00:02:30
    also have certification processes in
  • 00:02:33
    order to allow people of your
  • 00:02:35
    organization to verify that they have
  • 00:02:40
    the right knowledge to provide support
  • 00:02:42
    or even just to to verify that they have
  • 00:02:46
    the basic of the InfiniBand technology
  • 00:02:49
    in order to get most of that so this is
  • 00:02:53
    actually the Mellanox Academy if you
  • 00:02:56
    will enter to our Mellanox Academy you
  • 00:02:58
    will see that we have the white papers
  • 00:03:01
    you have videos you have simulators and
  • 00:03:04
    you have other tools that can help you
  • 00:03:08
    get your vertical knowledge as well as
  • 00:03:10
    practical knowledge if you are going
  • 00:03:13
    into configuration or installation
  • 00:03:17
    sessions so our sessions does not force
  • 00:03:21
    you to take again instruct
  • 00:03:25
    instructor-led training you can also do
  • 00:03:28
    a lot of jobs by taking the e-learning
  • 00:03:31
    sessions I'm not going to go into the
  • 00:03:37
    specific solutions that they are
  • 00:03:40
    provided actually I told you we what we
  • 00:03:44
    are providing and as I was saying we
  • 00:03:49
    have different types of learning methods
  • 00:03:54
    instructor web eLearning and those are
  • 00:04:00
    the main ones we have a list here of the
  • 00:04:03
    offering list of the training that you
  • 00:04:05
    can have so of course you don't have to
  • 00:04:08
    take it now but you need to be a world
  • 00:04:11
    that you have that option and as we were
  • 00:04:16
    saying we have flexible solution a
  • 00:04:19
    unique solution for all knowledge levels
  • 00:04:22
    and professional roles
  • 00:04:23
    in the infinite built environment and
  • 00:04:25
    InfiniBand and HPC as well
  • 00:04:29
    we have varied content of sessions and
  • 00:04:32
    courses then we have also flexible
  • 00:04:35
    payment options but between you and me
  • 00:04:38
    money is not a problem so suppose you
  • 00:04:42
    don't care about that of course we have
  • 00:04:45
    certified remote lab instructors and
  • 00:04:49
    course world that you can use for that
  • 00:04:52
    same atom so we really urge you to use
  • 00:04:58
    Mellanox academy not for maybe your own
  • 00:05:01
    personal use but your environment and
  • 00:05:04
    your partners and this is about Mellanox
  • 00:05:09
    academy this is decided by the way of
  • 00:05:11
    the Mellanox academy maybe try to use it
  • 00:05:15
    later and see what are the options and
  • 00:05:18
    that was my introduction regarding
  • 00:05:21
    Mellanox academy and let's start with
  • 00:05:26
    the main target that we would like to
  • 00:05:30
    cover today so the session is called
  • 00:05:34
    infinite vendor search essentials every
  • 00:05:37
    HPC expert must know so you are already
  • 00:05:40
    HPC experts so I suppose you should know
  • 00:05:45
    their following subjects some of you are
  • 00:05:49
    familiar with all of them some of you
  • 00:05:51
    are familiar with nothing some of you
  • 00:05:53
    are familiar with part of that let's try
  • 00:05:56
    to cover together again what we are
  • 00:05:59
    going to look at the IV principles we
  • 00:06:02
    are going to talk about the fabric
  • 00:06:04
    components the network components and
  • 00:06:06
    the fabric architecture we are going to
  • 00:06:10
    talk about the network of fabric
  • 00:06:14
    discovery mainly the jobs of the subnet
  • 00:06:19
    manager which is in charge of basically
  • 00:06:22
    everything in the network we are going
  • 00:06:26
    to talk about the main things the main
  • 00:06:30
    things that happens and the main
  • 00:06:32
    functions of each one of the InfiniBand
  • 00:06:37
    Oh to cool layers and we'll talk a bit
  • 00:06:41
    about Mellanox products this dissolve
  • 00:06:44
    the things that I'm going to cover today
  • 00:06:46
    of course if you will have specific
  • 00:06:49
    questions or questions that will arise
  • 00:06:52
    during my sessions you're free to ask I
  • 00:06:56
    hope I would be able to provide answers
  • 00:06:59
    to all questions and in case I won't be
  • 00:07:04
    able I promise to provide you the right
  • 00:07:07
    answers by mail again my name is a dead
  • 00:07:12
    or dead pass from Ella Knox I'm not
  • 00:07:15
    going to ask you for all your names now
  • 00:07:18
    it's beyond 20 people so I'm not going
  • 00:07:20
    to do that but if you ask questions I
  • 00:07:24
    will be happy if you are able to present
  • 00:07:27
    yourself well you are coming from what
  • 00:07:30
    is your job in the organization which
  • 00:07:32
    organization etc thank you very much and
  • 00:07:35
    we will begin something to ask before I
  • 00:07:38
    begin no okay
  • 00:07:44
    ah so when we talk about the InfiniBand
  • 00:07:49
    a InfiniBand environment
  • 00:07:52
    InfiniBand fabric InfiniBand network are
  • 00:07:57
    going to provide and provide the best
  • 00:08:00
    solution today especially to move
  • 00:08:05
    traffic to move data in HPC environment
  • 00:08:10
    and the reasons are because it does
  • 00:08:13
    provide three or four features that are
  • 00:08:18
    very very important for the users for
  • 00:08:21
    the customers for the clients for the
  • 00:08:24
    applications of the HPC the
  • 00:08:27
    high-performance computing environment
  • 00:08:30
    if you look at the slide that we see
  • 00:08:34
    here then a network usually a fabric
  • 00:08:38
    will have its clients the clients will
  • 00:08:42
    be a servers that are compute servers or
  • 00:08:46
    storage servers that have to move
  • 00:08:48
    traffic between them
  • 00:08:50
    in order to provide the switching
  • 00:08:53
    between them while providing the
  • 00:08:55
    InfiniBand switches which are working
  • 00:08:57
    with the InfiniBand protocol the
  • 00:09:01
    connections between the compute nodes to
  • 00:09:05
    the switches or the storage nodes to the
  • 00:09:07
    switches are used with InfiniBand
  • 00:09:11
    links those InfiniBand links today are
  • 00:09:15
    going in pace in a speed of 56 gigabit
  • 00:09:20
    per second which is gold FDR we're going
  • 00:09:24
    to talk about it later
  • 00:09:25
    and later this year the maximum rate per
  • 00:09:29
    port will reach 100 gigabit per second
  • 00:09:33
    in a speed rate that is gold or standard
  • 00:09:37
    that each specific part is called the
  • 00:09:40
    EDR enhanced data rate so currently we
  • 00:09:44
    are in the generation that allows us
  • 00:09:45
    spell port of silver and purple to the
  • 00:09:49
    switch of up to 56 gigabit per second if
  • 00:09:55
    we talk specifically about Mellanox
  • 00:09:58
    technology Mellanox technology
  • 00:10:00
    generation today is providing the
  • 00:10:04
    channel adapter on the server side of
  • 00:10:07
    the network adapter what is called the
  • 00:10:09
    HCA host channel adapters in a
  • 00:10:12
    generation which is called connect X ray
  • 00:10:14
    and connect IV I'm going to talk about
  • 00:10:17
    the differences between them and the
  • 00:10:20
    switches generation that do provide the
  • 00:10:23
    FDR capability are called switch X -
  • 00:10:27
    when you talk about the InfiniBand
  • 00:10:31
    environment and InfiniBand technology
  • 00:10:33
    you have specific integrated circuits
  • 00:10:38
    for that Isis or chips if you like and
  • 00:10:41
    in our case Mellanox is providing and
  • 00:10:44
    producing and planning those chips which
  • 00:10:47
    are part of the general adapters and
  • 00:10:49
    parts of the switches as we will say we
  • 00:10:53
    have the adapter cards that in the
  • 00:10:55
    InfiniBand I'll call the host channel
  • 00:10:58
    adapters of different types we have
  • 00:11:00
    different types because they provide
  • 00:11:04
    capabilities for different users and
  • 00:11:07
    customers according to to the need and
  • 00:11:10
    of course we have different type of
  • 00:11:12
    switches and the the most de or the main
  • 00:11:16
    reason for the difference between the
  • 00:11:18
    switches will be the capacity capacity
  • 00:11:20
    of poles capacity of servers that each
  • 00:11:24
    one of the switches may actually support
  • 00:11:26
    so we are talking about the switches
  • 00:11:28
    options or InfiniBand switches that are
  • 00:11:30
    going between range of 12 ports up to
  • 00:11:35
    648 ports for a specific switch and then
  • 00:11:40
    of course in order to to allow us to
  • 00:11:44
    enhance performance of customers to
  • 00:11:49
    enhance performance of applications we
  • 00:11:52
    also have a different type of software
  • 00:11:55
    solutions begins with management
  • 00:11:57
    solution like the ufm unified fabric
  • 00:12:00
    manager we are talking about the FCA
  • 00:12:03
    which is now part of the standard
  • 00:12:06
    actually part of redhead the FCA
  • 00:12:10
    accelerator or fabric collection
  • 00:12:13
    accelerations we have VSA for storage
  • 00:12:17
    virtual storage accelerator and other
  • 00:12:20
    options that will help you actually as
  • 00:12:24
    part of your solution that you provide
  • 00:12:27
    to your customers to provide better
  • 00:12:30
    latency better performance better usage
  • 00:12:34
    of the CPU and hardware resources of
  • 00:12:40
    course part of the part of the network
  • 00:12:43
    usually will be the media the media
  • 00:12:47
    between the servers to the switches the
  • 00:12:49
    media of the cables the cables are of
  • 00:12:51
    two main categories the copper cables
  • 00:12:54
    and the fiber cables this is the first
  • 00:12:56
    category beyond that category of copper
  • 00:13:00
    cables and fiber cables we have the
  • 00:13:02
    other category which is which talks
  • 00:13:05
    about is it a passive type of cable or
  • 00:13:08
    active type of cable just to make it
  • 00:13:13
    short and simple usually when we are
  • 00:13:15
    talking about the difference between
  • 00:13:18
    passive cable and active cable active
  • 00:13:21
    cable will allow you to have a longer
  • 00:13:23
    distance between two ends or two nodes
  • 00:13:26
    in the fabric so when we talk about the
  • 00:13:32
    targets targets of belanov solid targets
  • 00:13:35
    of the InfiniBand technology today we
  • 00:13:38
    are talking about different type of
  • 00:13:40
    applications like financial services
  • 00:13:42
    cloud database enterprises web tool and
  • 00:13:46
    of course the most important is HPC it
  • 00:13:50
    would be different if I'm going to talk
  • 00:13:52
    in the cloud environment but for this
  • 00:13:54
    sessions the HPC this is the most
  • 00:13:57
    environment a most important one so we
  • 00:14:01
    are talking about actually advantage and
  • 00:14:05
    better performance enhanced performance
  • 00:14:07
    to the HPC environment when you are
  • 00:14:10
    working with
  • 00:14:12
    InfiniBand specifically and of course
  • 00:14:15
    with some of our Mellanox solutions that
  • 00:14:19
    will allow you to provide ten times
  • 00:14:23
    better performance higher usage or
  • 00:14:27
    better usage of your cpu resources
  • 00:14:30
    better latency and of course very high
  • 00:14:34
    bandwidth from the net for the server
  • 00:14:38
    side to the switches side in order to
  • 00:14:42
    provide it of course we need different
  • 00:14:45
    components in the network to be data and
  • 00:14:49
    this is actually the job of the
  • 00:14:52
    technology the job of the fabric that
  • 00:14:55
    will provide the service for you so what
  • 00:15:00
    we are going to see today is what are
  • 00:15:03
    the basics again of the technology what
  • 00:15:06
    are the main components what is their
  • 00:15:08
    job and how could we make it better as I
  • 00:15:13
    was saying before in order to provide
  • 00:15:17
    the main components in the network
  • 00:15:20
    Mellanox
  • 00:15:22
    has the technology for the channel
  • 00:15:25
    adapter which is today in a generation
  • 00:15:27
    which is called connect x3
  • 00:15:31
    and in the switch's environment we are
  • 00:15:34
    talking about Switch x2 both of those
  • 00:15:38
    guys are able to provide today FDR 56
  • 00:15:43
    gigabit per second purport of the
  • 00:15:47
    channel adapter or purport of the switch
  • 00:15:54
    about to connect x3 as you can say
  • 00:15:57
    connect x3 will allow you to connect on
  • 00:16:00
    the bus side on the server side to
  • 00:16:03
    connect with PS PCI generation number 3
  • 00:16:08
    PCI s3 will be supported by connect x3
  • 00:16:13
    channel adapter cards any question in
  • 00:16:16
    this stage no okay so when we talk about
  • 00:16:24
    InfiniBand InfiniBand generation and
  • 00:16:27
    finland planning what what is the plan
  • 00:16:30
    ahead what is the history of the
  • 00:16:32
    InfiniBand so InfiniBand is the network
  • 00:16:35
    InfiniBand Eze
  • 00:16:36
    as a pipe provider was planned in around
  • 00:16:41
    2000 and in around 2000 the port
  • 00:16:46
    capability in the InfiniBand environment
  • 00:16:48
    it was huge for that specific period we
  • 00:16:54
    could have provided 10 gigabit per
  • 00:16:57
    second per port and here was 2001 in
  • 00:17:04
    2005 we came to the second generation
  • 00:17:07
    that was called DDR double data rate
  • 00:17:11
    with the capability of 20 gigabit per
  • 00:17:14
    second per port the next generation that
  • 00:17:17
    I suppose part of you're still working
  • 00:17:21
    with was a QD au q dr r is able to
  • 00:17:26
    provide 40 gigabit per second per port
  • 00:17:29
    can you tell me who in the audience has
  • 00:17:32
    in his cluster kudiye kudiye so 70
  • 00:17:39
    percent is actually still working with q
  • 00:17:41
    do so
  • 00:17:43
    can you also tell me what is the
  • 00:17:46
    capability in data that every QDR port
  • 00:17:50
    can provide can support how much data
  • 00:17:56
    can you forward on every QDR port you
  • 00:18:04
    are pointing that and this is you're
  • 00:18:06
    saying 40 gigabit what is your name yep
  • 00:18:12
    get really saying 40 gigabit do you
  • 00:18:15
    agree well he sank 32 okay so uh well
  • 00:18:25
    QDR
  • 00:18:26
    provides 40 gigabit per second of signal
  • 00:18:30
    you may say so the signal the bits if
  • 00:18:33
    you will con them you will see 40
  • 00:18:34
    gigabit per second but actually we will
  • 00:18:38
    see that in the QDR generation we have
  • 00:18:40
    encoding which is got 8 by 10 it
  • 00:18:43
    actually takes 20% of the line rate if
  • 00:18:47
    you like and provide it for for the
  • 00:18:50
    overhead encoding overhead so this is
  • 00:18:54
    the QDR generation both of you also have
  • 00:18:58
    ports a general adapters and switches of
  • 00:19:02
    the FDR generations who has FDR today so
  • 00:19:07
    we're coming to 10%
  • 00:19:10
    FDR 56 gigabit per second and this is
  • 00:19:14
    today and if we are going to talk about
  • 00:19:18
    the end of this year I don't want to
  • 00:19:20
    provide you specific dates then we would
  • 00:19:24
    be able to provide also 100 gigabit per
  • 00:19:28
    second per port and this is the new
  • 00:19:33
    generation or the current generation at
  • 00:19:35
    the end of 2014 which will be the idea
  • 00:19:38
    generation enhance data rate 100 gigabit
  • 00:19:43
    per second per port of your server 100
  • 00:19:46
    gigabit per second per port on the
  • 00:19:49
    switch
  • 00:19:51
    of course the other a element the other
  • 00:19:54
    parameter which is very important
  • 00:19:56
    specifically for high performance
  • 00:19:58
    compute will be the latency how much
  • 00:20:02
    time does it take for a packet for
  • 00:20:05
    calculation to go between one server to
  • 00:20:09
    the other server and this is actually
  • 00:20:12
    the latency the latency is the latency
  • 00:20:15
    when I'm talking here about the latency
  • 00:20:16
    what is the latency that I am as a
  • 00:20:19
    switch as InfiniBand switch I'm going to
  • 00:20:22
    add to the packet that comes into my
  • 00:20:26
    switch and then goes out I'm not going
  • 00:20:29
    to talk about the latency that is
  • 00:20:31
    actually happens if you like in the
  • 00:20:33
    server it happens in the process between
  • 00:20:37
    the application upper layers to the
  • 00:20:39
    lower layers I'm talking about the
  • 00:20:42
    latency that is added by the switch the
  • 00:20:46
    latency that is added by the switch
  • 00:20:48
    specifically we will talk about it it
  • 00:20:50
    may be again later when a packet comes
  • 00:20:54
    into InfiniBand switch of Mellanox in
  • 00:20:58
    our example until it goes out every
  • 00:21:01
    switch or every hope of a switch will
  • 00:21:06
    add 117 nano second of course you can
  • 00:21:12
    tell me all that but usually we are
  • 00:21:14
    going when a packet enters to the fabric
  • 00:21:17
    many times it is actually passing more
  • 00:21:21
    than one switch more than one hop it may
  • 00:21:25
    go to the first switch and then to the
  • 00:21:28
    second switch and then to the third
  • 00:21:30
    switch so how do you count that of
  • 00:21:32
    course I'm going to add the delay so if
  • 00:21:37
    i have delay 170 seconds in one switch
  • 00:21:41
    and then another one and then another
  • 00:21:43
    one
  • 00:21:43
    then i'm going to to count for five to
  • 00:21:47
    seven hundred nanoseconds of delay which
  • 00:21:49
    is still very very short and good delay
  • 00:21:53
    and this is in the color generation of
  • 00:21:56
    FDR
  • 00:21:59
    but what was the beginning of the
  • 00:22:02
    InfiniBand the beginning of the
  • 00:22:04
    InfiniBand 1999 there is organization
  • 00:22:08
    which is called was set up organization
  • 00:22:12
    that was called the IB ta InfiniBand
  • 00:22:15
    trade association found in 1999 and the
  • 00:22:19
    target of the IB ta was to provide the
  • 00:22:22
    network and the protocol for data Sony
  • 00:22:27
    this is not something which is new we
  • 00:22:31
    already had networks like X 25 and is
  • 00:22:35
    the N and Ethernet and Fibre Channel and
  • 00:22:39
    frame relay
  • 00:22:41
    so all those network are actually also
  • 00:22:45
    providing data and networks so what is
  • 00:22:49
    the difference what is the main
  • 00:22:52
    difference or what is one of the main
  • 00:22:54
    differences between date data network of
  • 00:22:57
    InfiniBand to the other data network so
  • 00:23:00
    first it was created it was thought
  • 00:23:02
    about in 2000
  • 00:23:04
    so in 2000 they were thinking already
  • 00:23:07
    about customers of the 2000s
  • 00:23:10
    applications of mm performance capacity
  • 00:23:15
    latency application HP sees that cannot
  • 00:23:21
    work with the other protocols because
  • 00:23:24
    they are too slow because they are too
  • 00:23:27
    narrow and because also because of
  • 00:23:32
    another a very important factor one of
  • 00:23:36
    the things that was not thought about in
  • 00:23:38
    the other network like Ethernet
  • 00:23:41
    switching they were not thinking about
  • 00:23:45
    what happens to a packet of data to a
  • 00:23:49
    message of data in the server itself
  • 00:23:54
    when you are taking a packet of data in
  • 00:23:58
    the tcp/ip environment that packet of
  • 00:24:01
    data is taken and then it is put into
  • 00:24:05
    frames in the higher layers and then it
  • 00:24:09
    goes in the server to the Kern
  • 00:24:12
    environment to the tcp/ip Colonel
  • 00:24:15
    environment and what happens there that
  • 00:24:19
    the colonel environment has two poses
  • 00:24:21
    each one of those packets each one of
  • 00:24:26
    the data packets has to be processed in
  • 00:24:28
    the kernel environment and then to be
  • 00:24:30
    sent to be copied to another buffer and
  • 00:24:33
    then from the buffer to be sent to the
  • 00:24:35
    channel adapter and in the channel
  • 00:24:38
    adapter to send it out that is normally
  • 00:24:42
    what happens in the other type of
  • 00:24:44
    protocols one of the things that was
  • 00:24:47
    thought about in the InfiniBand was how
  • 00:24:52
    to reduce the time that the packet will
  • 00:24:56
    take from the application layer until it
  • 00:25:01
    goes out from the server to the other
  • 00:25:04
    side that time has to be shortened that
  • 00:25:07
    time has to be cut and this is something
  • 00:25:11
    which is very unique to the InfiniBand
  • 00:25:15
    protocol and InfiniBand environment how
  • 00:25:18
    do you make it more efficient how do you
  • 00:25:22
    make it more short how do you make the
  • 00:25:26
    latency better within this distance
  • 00:25:29
    between the higher layer protocols to
  • 00:25:33
    the channel adapter before it goes out
  • 00:25:35
    to the other side so this is also
  • 00:25:37
    something that was a part of the
  • 00:25:40
    thoughts of the guys that were behind
  • 00:25:43
    the cavity of the InfiniBand behind the
  • 00:25:48
    planning of the InfiniBand like IBM ella
  • 00:25:51
    knox oracle HP cray intel and others
  • 00:25:54
    that probably i did not count here so
  • 00:25:59
    the i InfiniBand is actually switched
  • 00:26:02
    fabric architected architecture
  • 00:26:04
    interconnected technology connecting
  • 00:26:06
    cpus and iOS which provides super high
  • 00:26:12
    performance which provides high
  • 00:26:14
    bandwidth okay starting at 10 gigabit
  • 00:26:18
    per sector per second up to - ha 100
  • 00:26:22
    gigabit per second pair
  • 00:26:24
    port purport of a silver per port of the
  • 00:26:28
    switch the low-latency that is required
  • 00:26:32
    by your customers should be challenged
  • 00:26:35
    by the protocol and should be responded
  • 00:26:39
    by the provided protocol and as I was
  • 00:26:41
    telling you they minimal latency the
  • 00:26:45
    today's switch will head is 117 nano
  • 00:26:50
    second and the other challenge that I
  • 00:26:54
    was talking about how do we cut how do
  • 00:26:59
    we enhance the performance of the
  • 00:27:01
    application and the way that it sends
  • 00:27:03
    the information from the higher layer
  • 00:27:05
    protocols until it goes to the channel
  • 00:27:08
    adapter and the main mechanism which is
  • 00:27:12
    used for that and I suppose all of you
  • 00:27:16
    probably familiar with is using the
  • 00:27:21
    process using the mechanism which is
  • 00:27:24
    called the our DMA remote direct memory
  • 00:27:30
    access which will provide a better
  • 00:27:34
    latency for every packet that goes from
  • 00:27:36
    the upper layer protocol up to the our
  • 00:27:40
    channel adapter and will also allow us
  • 00:27:43
    to have a lower CPU a or better CPU
  • 00:27:49
    utilization better CPU utilization
  • 00:27:51
    because most of the traffic that was
  • 00:27:55
    processed until the usage of all the our
  • 00:27:58
    DMA that was processed in the kernel now
  • 00:28:03
    should not be processed in the kernel
  • 00:28:06
    this is one of the advantages of the our
  • 00:28:10
    DMA so traffic communication now in the
  • 00:28:15
    technology that we are planning the
  • 00:28:18
    traffic communication will bypass the
  • 00:28:20
    operating system will bypass the kernel
  • 00:28:24
    and it means that the CPU could be used
  • 00:28:29
    through much more processes which are
  • 00:28:33
    other than
  • 00:28:35
    dealing with the traffic itself
  • 00:28:42
    InfiniBand was originally designed for
  • 00:28:44
    loud scale grids and clusters it came to
  • 00:28:49
    increase the application performance it
  • 00:28:51
    can provide solutions for local area
  • 00:28:54
    network for an for a bigger network for
  • 00:28:58
    storage area networks and application
  • 00:29:00
    communications it will provide high
  • 00:29:04
    reliability cluster management what does
  • 00:29:08
    it mean high reliability cluster
  • 00:29:10
    management well this is a bit vague I
  • 00:29:14
    must say but the high reliability of
  • 00:29:17
    course means that if you have something
  • 00:29:19
    or an entity that will manage the
  • 00:29:22
    network and that entity fails down
  • 00:29:25
    automatically you will have a backup to
  • 00:29:28
    that entity that will take its place we
  • 00:29:32
    are talking about the first Network
  • 00:29:34
    which is an SDN Network Rail Sdn network
  • 00:29:40
    which means software-defined network
  • 00:29:43
    software managed Network which means
  • 00:29:47
    that we don't need someone we don't need
  • 00:29:50
    an administrator that will configure all
  • 00:29:54
    parameters all the parameters in your
  • 00:29:59
    network all the parameters up to a level
  • 00:30:01
    of a port can be defined today by the
  • 00:30:05
    entity that manages the network by
  • 00:30:08
    software and that entity that manages
  • 00:30:11
    the InfiniBand network by software is
  • 00:30:14
    called the subnet manager we're going of
  • 00:30:18
    course to talk about it before we
  • 00:30:21
    continue
  • 00:30:21
    since our DMA is one of the main
  • 00:30:25
    processes or one of the main solutions
  • 00:30:27
    here
  • 00:30:28
    what is the advantage of our DMA our DMA
  • 00:30:31
    again for those well somebody hill does
  • 00:30:34
    not know what our DMA is and please
  • 00:30:37
    raise your hand okay
  • 00:30:42
    so don't laugh if you have that question
  • 00:30:45
    it seems funny there are people in the
  • 00:30:48
    world which are not familiar with our
  • 00:30:50
    DMA okay so since everybody knows what
  • 00:30:55
    our DMA is and I suppose that you know
  • 00:30:57
    that without our DMA we have the
  • 00:31:00
    application message that goes on the
  • 00:31:03
    buffers on the user layer and from the
  • 00:31:06
    buffers in the user layer they have to
  • 00:31:08
    be copied every packet has to be copied
  • 00:31:12
    to the buffers in the operating system
  • 00:31:14
    level and from the operating system
  • 00:31:17
    level after it is processed in the
  • 00:31:20
    tcp/ip layers it has to be copied again
  • 00:31:24
    to the channel adapter layer the channel
  • 00:31:28
    adapter or the new claret is as it is
  • 00:31:31
    called here or as it is called in the
  • 00:31:33
    InfiniBand environment host channel
  • 00:31:37
    adapter so every packet is to go user
  • 00:31:40
    copy caramel copy a caramel process
  • 00:31:43
    kernel copy to the channel adapter and
  • 00:31:46
    go then goes to the other side this is
  • 00:31:50
    what happens without our DMA without DMA
  • 00:31:53
    we have a zero copy we have a zero copy
  • 00:31:57
    we don't have to copy the packets which
  • 00:32:00
    are the data packets we don't have to
  • 00:32:02
    copy them to the kernel we don't have to
  • 00:32:05
    copy them to the kernel so we saved it
  • 00:32:07
    we save the time we save the copy
  • 00:32:10
    process and we just take those packets
  • 00:32:14
    directly from the user layer of one
  • 00:32:18
    server we send it directly to the user
  • 00:32:22
    layer of the other server and this is
  • 00:32:25
    the RDMA over InfiniBand
  • 00:32:27
    that saves time saves latency enhance
  • 00:32:32
    the latency and of course allows us to
  • 00:32:35
    do much more work in the operating
  • 00:32:38
    system this is actually we can probe
  • 00:32:41
    again again between 60 to 80 percent of
  • 00:32:46
    improvement all efficiency if you like
  • 00:32:49
    in the cpu
  • 00:32:53
    because we don't need to process the
  • 00:32:55
    traffic in the operating system in the
  • 00:32:59
    kernel so let's look at the InfiniBand
  • 00:33:03
    architecture InfiniBand architecture
  • 00:33:06
    this is the basic structure like most of
  • 00:33:11
    the other data network that you know do
  • 00:33:14
    you have the laser pointer are working
  • 00:33:16
    in your yeah probably
  • 00:33:30
    okay data networks are Luke's looks
  • 00:33:37
    actually very similar if you'll take
  • 00:33:39
    Ethernet frame relay ISDN X 25 whatever
  • 00:33:44
    data network it will look like that
  • 00:33:47
    right so we have to talk about the
  • 00:33:51
    network itself and what are the parts
  • 00:33:53
    which are special in those network so
  • 00:33:56
    first let's talk about our customers our
  • 00:33:59
    customers are yours your customers as
  • 00:34:02
    well those are the HPC computing systems
  • 00:34:05
    or computing servers and the HPC storage
  • 00:34:09
    servers so those are my customer here I
  • 00:34:12
    have the application let's say that one
  • 00:34:15
    of the most common applications in your
  • 00:34:17
    environment will be a openmpi protocol
  • 00:34:21
    will be GPFS nvidia or some other type
  • 00:34:27
    of applications which are working here
  • 00:34:30
    on your clients then we are taking that
  • 00:34:33
    application and that application
  • 00:34:35
    calculation result is to go to the other
  • 00:34:39
    computation a server in order to move
  • 00:34:42
    the traffic to the other side we have to
  • 00:34:45
    take that packet and then we have to go
  • 00:34:49
    out to the other server how do we go to
  • 00:34:51
    the other server for the purposes are we
  • 00:34:55
    need switches the switches of doing the
  • 00:34:57
    switching so we are taking packets from
  • 00:35:00
    server number X to server number Y later
  • 00:35:04
    we are going to talk about addresses and
  • 00:35:06
    addressing so we are going to take the
  • 00:35:09
    packet and send it via the network cloud
  • 00:35:12
    of the InfiniBand environment the
  • 00:35:15
    network out of the InfiniBand
  • 00:35:17
    environment is called the HCA the hosts
  • 00:35:20
    channel adapter the hosts channel
  • 00:35:23
    adapter has actually two sides one side
  • 00:35:26
    is the side of the bus of the first
  • 00:35:29
    server which is the PCI bus and the
  • 00:35:32
    other side is the InfiniBand link the
  • 00:35:35
    InfiniBand link could go with the
  • 00:35:38
    relevant media to the switch what is the
  • 00:35:42
    relevant
  • 00:35:42
    idea the relevant media could be a Capel
  • 00:35:45
    if you would like to have higher
  • 00:35:47
    distance instead of copying or taking
  • 00:35:50
    fiber right okay so you took the
  • 00:35:54
    InfiniBand link up to the switch that
  • 00:35:57
    switch may have different type of our
  • 00:35:59
    ports in terms of capacity it could be a
  • 00:36:02
    switch that takes 12 volts it could be a
  • 00:36:05
    switch to take 648 ports in order to
  • 00:36:10
    move the traffic between debt server
  • 00:36:12
    server a to debt service lb I have to go
  • 00:36:16
    to at least one switch and sometimes to
  • 00:36:20
    several switches so therefore we are
  • 00:36:22
    going into the network to are between
  • 00:36:25
    one switch to another switch and then we
  • 00:36:29
    send information to the end destination
  • 00:36:33
    the data network or the fabric the
  • 00:36:38
    InfiniBand fabric here is working today
  • 00:36:43
    in a subnet or one subnet only the data
  • 00:36:49
    network today is working as a layer 2
  • 00:36:54
    data network what is the difference
  • 00:36:58
    between layer 2 data network to layer 3
  • 00:37:01
    data network
  • 00:37:08
    now this is a question that I know that
  • 00:37:10
    maybe some of you do not know this is
  • 00:37:13
    why I'm asking unlike our DMA yes IP
  • 00:37:20
    packet okay well of course you're right
  • 00:37:29
    about in part of the answer somebody
  • 00:37:33
    would like to expand No
  • 00:37:36
    okay so when I'm talking about one
  • 00:37:39
    subnet about switching of layer 2
  • 00:37:42
    switching layer 2 means that all of my
  • 00:37:46
    nodes so the difference is wrong ok so
  • 00:37:51
    all of my nodes belong to the same
  • 00:37:53
    subnet right now when I pee is a term
  • 00:38:00
    from another environment ok because in
  • 00:38:05
    the InfiniBand we have different types
  • 00:38:07
    of addressing okay so one subnet means
  • 00:38:11
    that all my nodes belong to the same
  • 00:38:14
    subnet to the same subnet therefore if
  • 00:38:17
    you have 20 nodes or two thousand nodes
  • 00:38:21
    all of them are part of the same subnet
  • 00:38:25
    and the switches currently here our
  • 00:38:29
    switches of layer 2 so it doesn't matter
  • 00:38:34
    of which organization here you are
  • 00:38:36
    coming today you're working with layer 2
  • 00:38:41
    network in the InfiniBand environment so
  • 00:38:45
    those are our switches about your
  • 00:38:48
    dressing we're going to talk in the next
  • 00:38:50
    stage so in most of your environments
  • 00:38:55
    you have the InfiniBand environment for
  • 00:38:58
    the HPC a for calculation for GPFS or
  • 00:39:02
    nvidia for storage for computing edge
  • 00:39:04
    storage and most of it is InfiniBand in
  • 00:39:08
    some of your network you have kind of
  • 00:39:10
    mixed environment and mixed environment
  • 00:39:14
    environment means
  • 00:39:16
    that maybe maybe two years from today or
  • 00:39:21
    twins are back
  • 00:39:23
    you had another network that is maybe
  • 00:39:25
    Ethernet network or fiber channel
  • 00:39:29
    network and sometimes you need to move
  • 00:39:33
    information between the part which is
  • 00:39:36
    the InfiniBand part
  • 00:39:38
    InfiniBand environment to the part which
  • 00:39:41
    is the ethernet environment so you have
  • 00:39:44
    two computers one computer one node is
  • 00:39:48
    in the ethernet environment and one node
  • 00:39:51
    is in the InfiniBand environment how do
  • 00:39:55
    you move traffic between two different
  • 00:39:57
    environments the answer is how using
  • 00:40:05
    routing no the answer is thank you by
  • 00:40:12
    the way for the response the answer here
  • 00:40:15
    is a gateway it's not very far of course
  • 00:40:20
    from the answer the answer is a gateway
  • 00:40:23
    what the Gateway is a gateway is
  • 00:40:25
    actually if you like a converter
  • 00:40:28
    a converter between two protocols a
  • 00:40:31
    converter between two languages so what
  • 00:40:35
    I would like to do here in my example I
  • 00:40:37
    would like to take a packet that is on a
  • 00:40:40
    server here in that environment - a
  • 00:40:43
    packet to a server in that environment
  • 00:40:46
    because here they talk InfiniBand ish
  • 00:40:51
    you know that language and here they
  • 00:40:55
    talk Ethernet we have actually to take
  • 00:40:58
    packets from the ib2 Ethernet and from
  • 00:41:02
    the ethernet - to IB for that purpose we
  • 00:41:05
    have two gateway this is the job of the
  • 00:41:08
    Gateway to enable us to talk between
  • 00:41:11
    ethernet InfiniBand or fiber channel to
  • 00:41:14
    InfiniBand the component is called
  • 00:41:16
    gateway good clear
  • 00:41:22
    usually the information or usually the
  • 00:41:25
    type of packets of course that will go
  • 00:41:27
    between the ethernet environment to the
  • 00:41:30
    InfiniBand environment will be IP
  • 00:41:33
    packets we are using in the IB we are
  • 00:41:38
    going to talk about it we are using a
  • 00:41:41
    function which is called IP over IP IP
  • 00:41:48
    over IP that will be the way we are
  • 00:41:52
    going to take packets IP packets here
  • 00:41:54
    and send them to IP packets on the
  • 00:41:57
    internet environment IP over IP now
  • 00:42:01
    let's talk about the management entity
  • 00:42:04
    one of the things that will make the
  • 00:42:07
    InfiniBand environment special and
  • 00:42:09
    unique is the capability of the
  • 00:42:12
    InfiniBand to be Sdn software-defined
  • 00:42:16
    Network and software-defined network
  • 00:42:19
    somebody here has managed internet
  • 00:42:21
    environment working with switches of
  • 00:42:27
    Ethernet whoo you're saying you're
  • 00:42:32
    nodding like Gabriel so if you would
  • 00:42:35
    like for example to change parameter in
  • 00:42:40
    on a villain or trunk or speed what do
  • 00:42:44
    you have to do you have to do it
  • 00:42:45
    yourself so it's kind of do-it-yourself
  • 00:42:47
    network but you have to do it yourself
  • 00:42:50
    so you need an administrator in Ethernet
  • 00:42:54
    environment in InfiniBand environment
  • 00:42:57
    actually as I was saying it is actually
  • 00:42:59
    the first network type that is Sdn
  • 00:43:03
    software-defined network so who is that
  • 00:43:07
    entity the entity in the InfiniBand
  • 00:43:10
    protocol that entity is called subnet
  • 00:43:14
    manager sm the subnet manager is the
  • 00:43:18
    most important entity in the network
  • 00:43:21
    because it does all the configuration up
  • 00:43:28
    to a level port of each one of the
  • 00:43:31
    switches of each one of the cell
  • 00:43:35
    every parameter is managed by the subnet
  • 00:43:39
    manager this is the configuration you
  • 00:43:44
    don't need any administrator to do any
  • 00:43:47
    changes here the subject manager will do
  • 00:43:49
    it all it does work we have we have an
  • 00:44:05
    unsatisfied customer
  • 00:44:08
    so the way to do that today in marketing
  • 00:44:13
    they are saying you're right Gabriel
  • 00:44:14
    what is the problem we're going to solve
  • 00:44:17
    it it's not that I'm dissatisfied is
  • 00:44:20
    just it's not just switching ok
  • 00:44:25
    first let's talk about it ok but we will
  • 00:44:29
    do it in one of the next stages ok so
  • 00:44:36
    this is the subnet managers that are
  • 00:44:38
    going according to the theory in most of
  • 00:44:41
    the time should work and the subnet
  • 00:44:44
    manager is doing everything what happens
  • 00:44:49
    if that subnet manager fails if the
  • 00:44:56
    subnet manager phase let's call him the
  • 00:44:58
    actives that a subnet manager if the
  • 00:45:01
    active subnet manager fails of course we
  • 00:45:03
    need another subnet manager in the
  • 00:45:07
    network the other subnet manager that
  • 00:45:10
    will be somewhere here will be the
  • 00:45:12
    standby or the fallback subnet manager
  • 00:45:17
    so we have in every network we must have
  • 00:45:20
    a fallback and an active subnet manager
  • 00:45:24
    by the way the protocol itself or the
  • 00:45:28
    standard itself of the InfiniBand allows
  • 00:45:31
    you to have more than two subnet
  • 00:45:33
    managers we suggest to have only two
  • 00:45:36
    subnet manager in a in the network so we
  • 00:45:41
    were talking about currently about a
  • 00:45:43
    layer 2 network although it will support
  • 00:45:46
    in the future in layers
  • 00:45:48
    the main components are the customers
  • 00:45:51
    itself the notes channel adapter that
  • 00:45:54
    will allows you to do the network
  • 00:45:55
    connection
  • 00:45:56
    the InfiniBand link itself that goes
  • 00:45:59
    from the server to the switch the
  • 00:46:01
    switches that currently are doing layer
  • 00:46:03
    2 switching the nodes belong all of them
  • 00:46:07
    to the same network or to the same
  • 00:46:09
    subnet and in order to move packets
  • 00:46:13
    between InfiniBand environment to other
  • 00:46:16
    type of protocol environment we have a
  • 00:46:19
    functionality which is called gateway
  • 00:46:22
    questions ok so we talked about the
  • 00:46:29
    different sorry the different elements
  • 00:46:34
    and first the host channel adapter the
  • 00:46:40
    host channel adapter device that
  • 00:46:43
    terminates the InfiniBand link here we
  • 00:46:46
    have the connector this is the host
  • 00:46:48
    channel doctor looks like I suppose
  • 00:46:52
    other channel adapters that you have
  • 00:46:54
    seen up today but the connector is
  • 00:46:57
    different the connector today in the
  • 00:46:59
    InfiniBand is called q SFP connector for
  • 00:47:05
    those of who for those of you I suppose
  • 00:47:08
    small number who are not familiar with
  • 00:47:10
    that then a QFP is a different connector
  • 00:47:13
    is not SSP it's not as a free plus it's
  • 00:47:17
    not rj45 it is q SFP so we have a QFP
  • 00:47:23
    connector that goes out to the towards
  • 00:47:27
    the switch this is the channel adapter
  • 00:47:28
    of course it's part of the server and
  • 00:47:31
    this is decided go to the bus of the
  • 00:47:34
    server the PCIe bus the PCI generation
  • 00:47:38
    today that we are talking about is PCI
  • 00:47:41
    generation 3 then we have the switches
  • 00:47:45
    the switch is a device that moves
  • 00:47:47
    packets from one link to another or from
  • 00:47:50
    one node to another on the same
  • 00:47:53
    InfiniBand subnet on the same InfiniBand
  • 00:47:56
    subnet this is why we call it layer 2
  • 00:48:01
    reaching okay then we have an element
  • 00:48:06
    which is called the router and now you
  • 00:48:10
    can ask me a dead but you told us that
  • 00:48:12
    we are working in layer 2 network so why
  • 00:48:17
    do you need a router the answer is that
  • 00:48:24
    currently practically we are not working
  • 00:48:27
    without us currently we have router in
  • 00:48:31
    the protocol we have router in the
  • 00:48:34
    standard we have layer three for routing
  • 00:48:37
    of InfiniBand we have addresses for
  • 00:48:41
    routing of InfiniBand routing will allow
  • 00:48:44
    us to move traffic between one subnet to
  • 00:48:49
    another submit so we have it in the
  • 00:48:53
    stand out we everything the theory
  • 00:48:56
    currently it's not yet implemented
  • 00:49:01
    almost 100% and then we have the gateway
  • 00:49:05
    or a Bridget it is it called and the
  • 00:49:08
    Gateway job will be to Anne I enable us
  • 00:49:10
    to move packets between InfiniBand
  • 00:49:15
    environment to ethernet environment Auto
  • 00:49:19
    fibre channel environment we talked
  • 00:49:23
    about the channel adapter and we said
  • 00:49:26
    that the journal adapter is connected
  • 00:49:28
    with a link to the switch the link could
  • 00:49:32
    work with different type of speeds it
  • 00:49:36
    could be SDR Oh d-dear
  • 00:49:39
    o QD r o FD r at the end of the year it
  • 00:49:43
    will be also ideal this is the way if
  • 00:49:47
    you like so if we are going back to 2001
  • 00:49:51
    the only capability was HDL single data
  • 00:49:56
    rate then we went to a detail double
  • 00:50:01
    data rate and then we have a queue do a
  • 00:50:05
    quadruple data rate and then we'll go
  • 00:50:08
    into FDR 14 data right this is the car
  • 00:50:11
    and generation and at the end of the
  • 00:50:14
    or God's will we're going to have the
  • 00:50:17
    India intense data wretches I think it's
  • 00:50:20
    actually a what 25 words alien not 25
  • 00:50:28
    sorry it's a mistake well please edit it
  • 00:50:36
    in the dick yeah well ah actually it was
  • 00:50:43
    put there intentionally to check if you
  • 00:50:46
    are really checking what is said here
  • 00:50:50
    right
  • 00:50:51
    okay so that was easier let's now go to
  • 00:50:54
    the addressing when we talk about the
  • 00:50:58
    dresses of protocols first we have the
  • 00:51:01
    physical address each one of us is a
  • 00:51:04
    physical address kind of genes I suppose
  • 00:51:08
    oh whatever the physical address in the
  • 00:51:12
    environment of InfiniBand is called
  • 00:51:15
    gooood gooood is global unique
  • 00:51:20
    identifier of that specific note so all
  • 00:51:29
    the elements in the environment are
  • 00:51:31
    nodes each node will have his own good
  • 00:51:34
    who is a node a channel adapter of a
  • 00:51:38
    server is a node a switch is also a node
  • 00:51:45
    okay a gateway is also a node so each
  • 00:51:49
    one of them is having physical address
  • 00:51:53
    which is a bit like a Mac if you're not
  • 00:51:57
    familiar with good yet then you can
  • 00:51:59
    compare it to a Mac so this is my
  • 00:52:03
    physical risk global unique identifier
  • 00:52:09
    the global unique identifier is an
  • 00:52:12
    address which is comprised of 64 bits
  • 00:52:16
    part of those bits of course will be the
  • 00:52:19
    vendor ID ok 64 bits
  • 00:52:24
    it is assigned by the InfiniBand vendo
  • 00:52:28
    and it is persistent through reboot so
  • 00:52:31
    if you're taking that channel adapter
  • 00:52:32
    with it support if the server goes down
  • 00:52:36
    and up the good is going to change yes
  • 00:52:43
    the going is the good is going to change
  • 00:52:46
    no okay that was a right so it is
  • 00:52:52
    persistent it is a fixed one the gooood
  • 00:52:55
    is a fixed parameter fix identification
  • 00:52:58
    of the node now this is for channel
  • 00:53:04
    adapter so now here I would like to
  • 00:53:06
    emphasize that point every port in a
  • 00:53:11
    channel adapter has its own gooood do
  • 00:53:15
    some of you have more than one port on
  • 00:53:17
    the same server it's like two ports on
  • 00:53:21
    the same server yes
  • 00:53:23
    how many goods do you have for two ports
  • 00:53:28
    one for each port right so this is
  • 00:53:31
    obvious Satna who is working with switch
  • 00:53:35
    with 36 ports your name oh sorry Chris
  • 00:53:47
    Chris how many goods do you have two
  • 00:53:49
    deaths which with 36 ports one good
  • 00:53:54
    right so in a switch you do not have a
  • 00:53:58
    specific gooood to every port the switch
  • 00:54:02
    you have only one Guido at least one
  • 00:54:05
    main grid let's call it a node grid of
  • 00:54:08
    the switch so the switch has the node
  • 00:54:12
    grid so how will we identify specific
  • 00:54:16
    port in that switch if you have felt its
  • 00:54:22
    exports how will you identify the port
  • 00:54:27
    put a the port number exactly so each
  • 00:54:31
    port has a port number so if you like to
  • 00:54:35
    identify a specific port you actually
  • 00:54:37
    take the gooood of the switch plus the
  • 00:54:41
    port number right
  • 00:54:43
    there is another type of grid which is
  • 00:54:46
    called the system grid someone is
  • 00:54:49
    familiar with system good who has a
  • 00:54:55
    chassis a switch or modular switch which
  • 00:55:00
    is more than 36 ports no one here he'll
  • 00:55:04
    you have which type of very switch big
  • 00:55:10
    fabric one if I have a big fabric one it
  • 00:55:13
    means that on the same chassis on the
  • 00:55:15
    same physical chassis you have actually
  • 00:55:17
    multiple switches so each one of those
  • 00:55:21
    switches will have its own node good but
  • 00:55:25
    the chassis as a whole will have another
  • 00:55:28
    grid which is called a system grid so
  • 00:55:32
    the system go it will be the identifier
  • 00:55:34
    of the whole chassis and each one of
  • 00:55:37
    those switches within the medulla will
  • 00:55:40
    have a specific grid okay IB fabric IB
  • 00:55:48
    fiber fabric basic building block when
  • 00:55:52
    you have InfiniBand fabric InfiniBand
  • 00:55:55
    network that InfiniBand network is built
  • 00:55:58
    according to the needs what are what is
  • 00:56:01
    the basic need usually the basic need
  • 00:56:04
    will be the capacity how many nodes do
  • 00:56:07
    you need in your cluster right of course
  • 00:56:12
    this is where only one parameter so when
  • 00:56:17
    we are talking about the fabric basic
  • 00:56:21
    building block each one of those block
  • 00:56:23
    is going to be a switch by the way each
  • 00:56:28
    one of the switches is based on an ASIC
  • 00:56:32
    what is the ASIC
  • 00:56:35
    actually the ASIC is the heart of your
  • 00:56:37
    switch basic is a switch basic is a chip
  • 00:56:42
    the exit is it is a switching chip and
  • 00:56:47
    today this generation every ASIC has 36
  • 00:56:52
    port why do I say this generation
  • 00:56:55
    because if you're looking at two
  • 00:56:57
    thousand and four or five the ASIC the
  • 00:57:00
    basic ASIC was actually structured of 24
  • 00:57:04
    ports today basic is 36 ports when you
  • 00:57:10
    build a network the network is built
  • 00:57:13
    according again to the number of nodes
  • 00:57:15
    that you have to provide that you have
  • 00:57:17
    to support it could be a network for
  • 00:57:21
    some of you you have I suppose 100-day
  • 00:57:24
    nodes some of you might have 600 nodes
  • 00:57:29
    some of our customers have 10,000 nodes
  • 00:57:35
    10,000 and more so of course the number
  • 00:57:39
    of switches that you will have to
  • 00:57:41
    provide will be a different number
  • 00:57:43
    according to the number of nodes that
  • 00:57:46
    you will have to connect okay here let's
  • 00:57:54
    look at that slide everybody can see the
  • 00:57:56
    slide even from the last lines I hope
  • 00:58:00
    okay so every switch here let's take a
  • 00:58:04
    basic switch in our cluster and as you
  • 00:58:10
    can see the nodes are connected the
  • 00:58:13
    nodes are connected to switches the
  • 00:58:17
    switches that are used for the direct
  • 00:58:20
    connection the direct connection to the
  • 00:58:23
    nodes those switches have a name that
  • 00:58:27
    somebody can tell us what is the name of
  • 00:58:28
    those switches
  • 00:58:33
    yes exactly those switches are called
  • 00:58:37
    the lifts
  • 00:58:38
    okay so Dec denotes the servers or are
  • 00:58:43
    connected to the leaf switches each one
  • 00:58:46
    of those is a leaf now let's say that in
  • 00:58:50
    that example as you can see every leaf
  • 00:58:54
    is going to be connected with 18 ports
  • 00:58:58
    to 18 nodes so 18 ports of dead leaf and
  • 00:59:03
    18 ports of dead leaf and 18 ports of
  • 00:59:07
    dead leaf and so on so if you are using
  • 00:59:11
    18 ports from the leaf to their nodes
  • 00:59:15
    what do you need the idea of 18 ports
  • 00:59:18
    for so I have felt its exports 18 are
  • 00:59:27
    used for the connection to the nodes
  • 00:59:30
    themselves what do I do with those 18
  • 00:59:35
    ports trunks okay the other 18 ports are
  • 00:59:41
    actually used as connections to the
  • 00:59:43
    other to the next layer the next layer
  • 00:59:47
    in the fabric is called how how do you
  • 00:59:52
    call the art those those switches oh yes
  • 00:59:59
    oh it has another name
  • 01:00:04
    backbone also spine spine switches so we
  • 01:00:11
    have two basic type of switches we have
  • 01:00:15
    the leaf switches which are connected
  • 01:00:18
    directly to the server and we have the
  • 01:00:20
    spine switches and the spine switches
  • 01:00:24
    have one main job the spine switches are
  • 01:00:29
    actually will help us to make
  • 01:00:32
    interconnection between servers that do
  • 01:00:37
    not belong to the same to the same what
  • 01:00:44
    live to the same leaf because let's say
  • 01:00:47
    that you have those servers here servers
  • 01:00:51
    that belong to leaf a and here you have
  • 01:00:54
    servers that belong to leaf D how will
  • 01:00:57
    you connect between those servers to
  • 01:01:00
    those of us the answer is you need
  • 01:01:03
    another layer that layer will allow you
  • 01:01:06
    to make inter connection between
  • 01:01:11
    different Leafs you would be able to
  • 01:01:15
    move packets between Detlef to that leaf
  • 01:01:20
    via dead spine okay so this is the spine
  • 01:01:26
    layer this is the job of the spine layer
  • 01:01:29
    to allow connections between traffic
  • 01:01:31
    that goes or belong to different lifts
  • 01:01:35
    now if you paid attention i took 18
  • 01:01:40
    ports to the servers and then I have 18
  • 01:01:44
    poles that are going to the spine
  • 01:01:46
    switches so actually 18 ports to towards
  • 01:01:50
    the servers and 18 ports towards the
  • 01:01:53
    spines for interconnection why did I
  • 01:02:00
    take 18 ports here and 18 ports for
  • 01:02:03
    interconnection
  • 01:02:08
    yes at the same time right so actually
  • 01:02:17
    one of the things you would like as HPC
  • 01:02:20
    customers you would like the network to
  • 01:02:23
    be non-blocking network and one of the
  • 01:02:32
    ways to get close to that target is that
  • 01:02:36
    if for example you have 18 nodes or 18
  • 01:02:41
    servers in order to verify that they
  • 01:02:46
    will get the bandwidth that they need
  • 01:02:49
    when they want to send out information
  • 01:02:51
    to other 18 servers you would like to
  • 01:02:56
    have the same bandwidth of the nodes
  • 01:03:01
    will be supported by equal bandwidth in
  • 01:03:05
    the interconnection fabric so if you
  • 01:03:10
    would like to move 100 gigabit per
  • 01:03:13
    second from the nodes you need 100
  • 01:03:17
    gigabit per second in the
  • 01:03:19
    interconnection environment to support
  • 01:03:21
    it so actually if you have a number of
  • 01:03:28
    18 ports of FDR to the nodes you will
  • 01:03:34
    need 18 ports of FDR in the
  • 01:03:38
    interconnection if you have 72 ports of
  • 01:03:42
    FDR to the nodes you will need 72 port
  • 01:03:48
    of FDR in the interconnection in that
  • 01:03:52
    way you will enhance the chances that
  • 01:03:55
    every time you will have been with that
  • 01:03:59
    is required for me node to be to go out
  • 01:04:02
    to another server you will have that
  • 01:04:06
    bandwidth in the interconnection
  • 01:04:08
    environment this is called non-blocking
  • 01:04:11
    go if you would like virtual or semi non
  • 01:04:14
    blocking
  • 01:04:15
    in the interconnection environment now
  • 01:04:20
    of course it's not the total proof it's
  • 01:04:25
    not 100% but this is one of the rules
  • 01:04:30
    that we use in order to have kind of
  • 01:04:34
    semi a non-blocking environment of
  • 01:04:38
    course we can we could have taken
  • 01:04:41
    instead of 18 4 0 and 18 ports in the
  • 01:04:44
    interconnection environment we could
  • 01:04:46
    have taken for example 27 here and 9
  • 01:04:50
    here ok what happens if you have 27
  • 01:04:54
    ports here and 9 ports that are going to
  • 01:04:58
    the interconnection you have actually a
  • 01:05:01
    different ratio between the
  • 01:05:06
    interconnection pause - the connections
  • 01:05:09
    that are going to the servers and the
  • 01:05:12
    different ratio means that you have
  • 01:05:14
    higher number of node poles then calls
  • 01:05:18
    of interconnection you will get a
  • 01:05:21
    blocking ratio okay let's go to the next
  • 01:05:29
    one if you have questions please ask me
  • 01:05:31
    yes please why do the fighting bull have
  • 01:05:36
    36 ports of example a power of 230 wide
  • 01:05:41
    the switches do not have a different
  • 01:05:43
    number by this choice in this case what
  • 01:05:47
    long standing car 36 ports okay the
  • 01:05:50
    reason is the reason is that the
  • 01:05:54
    integrated circuits which are produced
  • 01:05:58
    the Asics
  • 01:05:59
    which are produced the basic ASIC today
  • 01:06:03
    has 36 ports although I can tell you
  • 01:06:07
    that we also have a switch of 12 ports
  • 01:06:10
    for example which is also possibility
  • 01:06:14
    which is clear because of the hardware
  • 01:06:16
    but exactly
  • 01:06:17
    six when Isaac was designed by 36 was
  • 01:06:20
    I'm not sure I can tell you I can tell
  • 01:06:29
    you that the previous generation was 24
  • 01:06:37
    you're talking about the flip-flops
  • 01:06:40
    inside and so on I suppose anyway this
  • 01:06:43
    is the structure this is the internal
  • 01:06:45
    structure okay so in that example we
  • 01:06:48
    have actually a fabric that will support
  • 01:06:53
    72 volts or 72 notes
  • 01:06:57
    why 72 notes because we have created we
  • 01:07:02
    have created an environment that has
  • 01:07:04
    four lives each one of the Leafs has 36
  • 01:07:10
    ports and demand those lt6 ports we have
  • 01:07:14
    18 poles to the nodes so 18 18 18 18 872
  • 01:07:20
    ports towards the nodes and then we have
  • 01:07:23
    72 ports which are providing us the
  • 01:07:27
    support in the interconnection
  • 01:07:28
    environment question
  • 01:07:36
    so we have four chips which our function
  • 01:07:39
    is lines or Leafs and two chips which
  • 01:07:43
    will function as goals or spines okay so
  • 01:07:51
    that was a main a structure of the
  • 01:07:55
    fabric let's go to the next stage we
  • 01:07:57
    talked about the physical address that
  • 01:08:00
    we call the gooood let's go to the next
  • 01:08:02
    stage when we are moving traffic between
  • 01:08:05
    one node to another node we need to have
  • 01:08:08
    an address that rests which is used is
  • 01:08:12
    the layer to address so when I would
  • 01:08:14
    like to talk with the Chris or when I
  • 01:08:17
    would like to talk with Gabriel each one
  • 01:08:21
    of them will have a layer to address the
  • 01:08:24
    layer to address in InfiniBand
  • 01:08:26
    environment is called lead local
  • 01:08:29
    identifier this is the lead the local
  • 01:08:33
    identifier it has 16 bits 16 bits it
  • 01:08:38
    means that the range that is supported
  • 01:08:41
    could be up to 2 by 16 which is about
  • 01:08:45
    how many addresses about 65,000 right
  • 01:08:51
    yes this is the layer to address range
  • 01:08:54
    who is providing the addresses well one
  • 01:08:59
    function this is the subnet manager each
  • 01:09:02
    one of the nodes in this room is going
  • 01:09:05
    to get lead provided by the subnet
  • 01:09:08
    manager the lid again is called local
  • 01:09:14
    identifier so each one of your servers
  • 01:09:17
    has one or two leads it depends of the
  • 01:09:23
    number of the ports that you are
  • 01:09:24
    connected with it is assigned by the
  • 01:09:27
    subnet manager and it is theoretically
  • 01:09:31
    not persistent doing reboot
  • 01:09:37
    theoretically why do I stay to radically
  • 01:09:40
    if I will give you the option what would
  • 01:09:43
    you like to what do you prefer would you
  • 01:09:45
    like would you like it to be changed
  • 01:09:47
    we sit on the server what who said no no
  • 01:09:57
    yeah you wouldn't like you would like
  • 01:10:00
    you will prefer your network you will
  • 01:10:04
    prefer your fabric to say to stay as
  • 01:10:06
    stable as possible you would like that
  • 01:10:09
    if Greece would like to call Gabriel
  • 01:10:12
    even if great brain went down because of
  • 01:10:16
    some issue with well even in this case
  • 01:10:24
    when it comes up we would like him to
  • 01:10:27
    stay with the same lid so of course
  • 01:10:29
    there are parameters that will allow you
  • 01:10:31
    to say that even if server goes down we
  • 01:10:35
    would prefer that when it comes up we
  • 01:10:38
    would prefer him to come with the same
  • 01:10:41
    layer to address with the same lid when
  • 01:10:45
    we talk about the leader dresses we have
  • 01:10:48
    two ranges of addresses because we have
  • 01:10:50
    two types of communication in the
  • 01:10:53
    InfiniBand environment the first type of
  • 01:10:56
    communication and InfiniBand environment
  • 01:10:58
    is the unicast communication when I talk
  • 01:11:02
    with Gabriel only two of us of us this
  • 01:11:05
    is unicast when we talk here in the room
  • 01:11:08
    and everybody heals this is what when I
  • 01:11:15
    send information to a group of listeners
  • 01:11:19
    it is is it unicast what is it it is
  • 01:11:25
    multicast of course somebody will ask
  • 01:11:28
    yeah but it's like a broadcast actually
  • 01:11:30
    but in InfiniBand we do not have
  • 01:11:33
    broadcast InfiniBand we have only
  • 01:11:37
    multicast and the simulation if you
  • 01:11:41
    would like to do like a broadcast you
  • 01:11:43
    will actually create a multicast group
  • 01:11:45
    that will include all the participants
  • 01:11:49
    so InfiniBand there is no broadcast only
  • 01:11:54
    multicast so if you talk about the
  • 01:11:57
    specific addresses you have a unicast
  • 01:12:00
    address range which
  • 01:12:01
    is between 1/2 bfff and you have a
  • 01:12:04
    multicast range which is between cw-1 to
  • 01:12:08
    ffff II these are the lead addresses who
  • 01:12:14
    provides the leads subnet manager okay
  • 01:12:22
    the next element that I wanted to show
  • 01:12:25
    you is that so as I was saying we have
  • 01:12:29
    only one subnet in your organization
  • 01:12:33
    most of you one of the questions will be
  • 01:12:37
    and it might come from you so what if we
  • 01:12:41
    would like if we would like to have
  • 01:12:44
    different departments or different
  • 01:12:47
    customers in our organization in our
  • 01:12:51
    network how can we do that in InfiniBand
  • 01:12:55
    we know how to do it in ethernet how do
  • 01:12:58
    we allow different departments to have
  • 01:13:02
    the different kind of environment in
  • 01:13:04
    InfiniBand we have only one subject can
  • 01:13:10
    I ask if some of you in his environment
  • 01:13:12
    is such a need or thought of it such a
  • 01:13:17
    need who knows what the solution is well
  • 01:13:22
    since um there is no answer I will
  • 01:13:25
    answer myself I will ask an answer so
  • 01:13:29
    the answer will be partitioning
  • 01:13:34
    partitioning is the way that you can
  • 01:13:37
    actually implement in your environment
  • 01:13:39
    on your InfiniBand fabric in order to
  • 01:13:43
    have to define different partitions for
  • 01:13:47
    different customers you may have the red
  • 01:13:51
    partition and the green partition and
  • 01:13:53
    the blue partition which all belonged to
  • 01:13:56
    the same subnet but their packets have
  • 01:14:01
    different color actually it's called
  • 01:14:04
    different identifiers and that specific
  • 01:14:08
    identifier is called partition
  • 01:14:13
    partition key okay is actually the
  • 01:14:18
    identifier of every different partition
  • 01:14:22
    so now you can say I have the storage
  • 01:14:26
    guys I have the openmpi guys I have
  • 01:14:31
    other type of guys IP over IB guys and I
  • 01:14:36
    would like to hit two to have them on a
  • 01:14:39
    separated environment although all of us
  • 01:14:43
    belong to the same subnet that is called
  • 01:14:48
    partitioning so those dead partitioning
  • 01:14:55
    can help us to provide different
  • 01:14:58
    environment to different applications or
  • 01:15:03
    different environment because of
  • 01:15:05
    security purposes because for example in
  • 01:15:08
    your Institute you have two researchers
  • 01:15:12
    one of them is for University and the
  • 01:15:17
    other one is for Boeing or Airbus and
  • 01:15:22
    the research for Airbus the Airbus guys
  • 01:15:25
    who are paying money for that they are
  • 01:15:27
    telling you we wouldn't like to have our
  • 01:15:30
    part of the network to have the packets
  • 01:15:34
    touch with the other common people
  • 01:15:37
    packets can you separate them the answer
  • 01:15:41
    is yes partitioning
  • 01:15:46
    the packets of the red environment will
  • 01:15:49
    have packets or partition key ID - and
  • 01:15:53
    the green environment will have a
  • 01:15:55
    partition kd3 this is not the only thing
  • 01:16:01
    the other thing that you can have with
  • 01:16:04
    partitioning is quality of service which
  • 01:16:07
    could be different could be different
  • 01:16:11
    it's not the only tool but is one of the
  • 01:16:14
    tools that will allow you to provide
  • 01:16:16
    different quality of service to open MPI
  • 01:16:20
    and other quality of service - GPFS why
  • 01:16:26
    because I might put them I might place
  • 01:16:30
    them on different environment or
  • 01:16:32
    different partition yeah okay let's stop
  • 01:16:40
    there let's stop here we'll have some
  • 01:16:43
    coffee break and we will continue in 15
  • 01:16:47
    15 minutes
Tags
  • InfiniBand
  • Mellanox
  • HPC
  • RDMA
  • Subnet Manager
  • training
  • Leaf and Spine
  • network performance
  • data rate