InfiniBand Principles Every HPC Expert MUST Know (Part 1)
Summary
TLDREd Paws, from Mellanox, introduces the basics of InfiniBand technology and Mellanox Academy in this presentation. The focus is on educating HPC experts on the foundational elements of InfiniBand, including its architecture and components like Leaf and Spine switches, and the role of the Subnet Manager in network management. InfiniBand is highlighted for its high-speed, low-latency performance in data traffic movement, making it ideal for high-performance computing environments. The session also covers Mellanox Academy's offerings, which include a variety of training methods like instructor-led sessions, online courses, and certifications to enhance knowledge of Mellanox products and technologies. Key technologies discussed include remote direct memory access (RDMA) for reduced latency and CPU usage, as well as upcoming advancements like enhanced data rate (EDR) in InfiniBand links.
Takeaways
- 👋 Ed Paws from Mellanox introduces himself and the session.
- 📚 Mellanox Academy provides diverse training formats.
- 🚀 InfiniBand excels in high-speed, low-latency data transport.
- 🔗 Current links are 56 Gbps, aiming for 100 Gbps soon.
- 🔧 Subnet Manager automates network settings.
- 🖥️ Leaf and Spine are key switch types in the architecture.
- 🏎️ RDMA boosts performance by bypassing the kernel.
- 🌌 InfiniBand networks are designed for scalability.
- 🔑 GUIDs are unique identifiers in InfiniBand.
- 🛡️ Partitioning offers security and QoS in networks.
Timeline
- 00:00:00 - 00:05:00
The speaker introduces himself as a training professional from Mellanox, focusing on InfiniBand technology relevant to HPC experts. He highlights the offerings of Mellanox Academy, including various training services and resources.
- 00:05:00 - 00:10:00
The session aims to cover InfiniBand essentials for HPC experts, focusing on fabric architecture, network components, and Mellanox technologies. The speaker encourages questions and emphasizes practical application of concepts.
- 00:10:00 - 00:15:00
InfiniBand provides a high-performance solution for HPC environments, supporting speeds up to 56 Gbps and expected increases to 100 Gbps. Mellanox offers various generations of technology enhancing data rate and latency.
- 00:15:00 - 00:20:00
Mellanox provides host channel adapters and switches integral to InfiniBand technology, supporting flexible scalability and superior performance in HPC environments.
- 00:20:00 - 00:25:00
The speaker explains the InfiniBand generation history, detailing the evolution from 10 Gbps SDR to 56 Gbps FDR, and towards 100 Gbps EDR. Critical advancements include reduced latency and improved efficiency via RDMA.
- 00:25:00 - 00:30:00
InfiniBand was designed for high performance and reliability in network communications, offering low latency and high bandwidth. It's managed via software-defined networking, specifically by a subnet manager.
- 00:30:00 - 00:35:00
The subnet manager, a key entity in InfiniBand networks, facilitates configuration and redundancy. InfiniBand's use of RDMA minimizes CPU load and enhances data transfer efficiency, crucial for HPC tasks.
- 00:35:00 - 00:40:00
InfiniBand architecture features include host channel adapters, switches for subnet connectivity, and gateways for cross-protocol communication. It's primarily a layer 2 protocol facilitating efficient node communication.
- 00:40:00 - 00:45:00
Certain environments may require gateways to bridge InfiniBand with Ethernet or Fibre Channel, allowing cross-protocol data exchange typically managed through IP over IB standards.
- 00:45:00 - 00:50:00
Subnet managers automate configuration in InfiniBand networks as opposed to manual setups typical in Ethernet. Redundancy ensures network reliability through active and standby subnet managers.
- 00:50:00 - 00:55:00
Layer 2 addressing in InfiniBand is handled through local identifiers, managed by the subnet manager. These are subject to change on reboot unless configured otherwise for stability.
- 00:55:00 - 01:00:00
Fabric architecture in InfiniBand involves hierarchical topology with leaf and spine layers. Leaf switches connect directly to nodes, while spine switches enable cross-leaf communication for scalability.
- 01:00:00 - 01:05:00
InfiniBand supports semi-non-blocking environments through strategic port allocation, aiming to balance server-to-server bandwidth across the network fabric.
- 01:05:00 - 01:10:00
Partitioning in InfiniBand allows virtual segmentation within a single subnet, facilitating distinct operational environments or security boundaries for different applications or clients.
- 01:10:00 - 01:16:53
The speaker concludes the session, highlighting InfiniBand's ability to support varied network demands through adaptable architecture and partitioning. The session pauses for a break.
Mind Map
Video Q&A
What is Mellanox Academy?
Mellanox Academy is an organization that provides training services, including instructor-led, web-based, and e-learning, to help individuals gain knowledge in InfiniBand and other Mellanox technologies.
What is InfiniBand used for?
InfiniBand is used for high-performance computing environments to move data traffic with high speed, low latency, and increased bandwidth.
What speeds do current InfiniBand links support?
Current InfiniBand links support speeds up to 56 gigabits per second, with plans to reach 100 gigabits per second.
What is RDMA?
RDMA (Remote Direct Memory Access) is a technology that allows data to move directly between computers, bypassing the kernel and reducing latency and CPU load.
What is the main role of the Subnet Manager in InfiniBand?
The Subnet Manager is responsible for managing the network configurations of InfiniBand, including assigning layer 2 addresses (LIDs) and ensuring network stability.
What is the difference between a Leaf switch and a Spine switch?
A Leaf switch connects directly to servers, while a Spine switch connects multiple Leaf switches, facilitating inter-server communication.
What does non-blocking mean in the context of network switches?
Non-blocking in network switches means the network can handle arbitrary heavy traffic without delays, ensuring that the bandwidth in the interconnection fabric matches the demands of the server connections.
How is the physical address identified in InfiniBand networks?
In InfiniBand networks, the physical address is identified by a GUID (Global Unique Identifier), which is a 64-bit address assigned to every device.
What is a Gateway in InfiniBand networks?
A Gateway is a device used to convert and transfer data packets between different network protocols such as Ethernet and InfiniBand.
How do InfiniBand networks achieve software-defined networking (SDN)?
InfiniBand networks achieve SDN through the Subnet Manager, which automates and manages all network configurations and parameters.
View more video summaries
Chloroplast: Structure and Function|| Biology|| Cell biology
Inclusion Classrooms
The Seven Years War and the Great Awakening: Crash Course US History #5
The American Yawp Chapter 17: Conquering the West
Lysosomes and peroxisomes | Cells | MCAT | Khan Academy
How And Why Data Engineers Need To Care About Data Quality Now - And How To Implement It
- 00:00:03okay I'll begin first with the
- 00:00:06introduction again my name is dead paws
- 00:00:09I'm from Mellanox I'm working in
- 00:00:12Mellanox for three and a half years
- 00:00:14about and doing training I'm part of a
- 00:00:18support division which means we are
- 00:00:21involved in all the issues if you like
- 00:00:23and technologies that Mellanox is
- 00:00:26dealing with some of you might have met
- 00:00:30me maybe last year on Twitter go well
- 00:00:33last year and my main intention today is
- 00:00:40actually to provide the HPC experts
- 00:00:44because you are actually HPC experts to
- 00:00:48the guys among you which are not
- 00:00:51familiar with Beth basics of InfiniBand
- 00:00:56and the most important issues in
- 00:01:00InfiniBand skeleton if you like or
- 00:01:03fabric I would like to make it clear to
- 00:01:07most of us so this is my main intention
- 00:01:11today I have additional small target and
- 00:01:16the small target is to provide you an
- 00:01:19information about Mellanox academy
- 00:01:24Mellanox academy actually is a
- 00:01:28organization that can help you get
- 00:01:32different type of information we still
- 00:01:35have problems with that yeah okay so we
- 00:01:44would like to allow you to offer you
- 00:01:47training services that could be online
- 00:01:50service services that could be a
- 00:01:54instructor-led training could be also
- 00:01:58web training or okay that's better I
- 00:02:00suppose and this is part of Mellanox
- 00:02:03Academy so a lot of subjects that some
- 00:02:09of you have holes of knowledge in them
- 00:02:13are covered as part of Mellanox Academy
- 00:02:16and as you can see as I was saying we
- 00:02:20are providing in a instructor-led
- 00:02:23training which is me and some other
- 00:02:26people we have online training and we
- 00:02:30also have certification processes in
- 00:02:33order to allow people of your
- 00:02:35organization to verify that they have
- 00:02:40the right knowledge to provide support
- 00:02:42or even just to to verify that they have
- 00:02:46the basic of the InfiniBand technology
- 00:02:49in order to get most of that so this is
- 00:02:53actually the Mellanox Academy if you
- 00:02:56will enter to our Mellanox Academy you
- 00:02:58will see that we have the white papers
- 00:03:01you have videos you have simulators and
- 00:03:04you have other tools that can help you
- 00:03:08get your vertical knowledge as well as
- 00:03:10practical knowledge if you are going
- 00:03:13into configuration or installation
- 00:03:17sessions so our sessions does not force
- 00:03:21you to take again instruct
- 00:03:25instructor-led training you can also do
- 00:03:28a lot of jobs by taking the e-learning
- 00:03:31sessions I'm not going to go into the
- 00:03:37specific solutions that they are
- 00:03:40provided actually I told you we what we
- 00:03:44are providing and as I was saying we
- 00:03:49have different types of learning methods
- 00:03:54instructor web eLearning and those are
- 00:04:00the main ones we have a list here of the
- 00:04:03offering list of the training that you
- 00:04:05can have so of course you don't have to
- 00:04:08take it now but you need to be a world
- 00:04:11that you have that option and as we were
- 00:04:16saying we have flexible solution a
- 00:04:19unique solution for all knowledge levels
- 00:04:22and professional roles
- 00:04:23in the infinite built environment and
- 00:04:25InfiniBand and HPC as well
- 00:04:29we have varied content of sessions and
- 00:04:32courses then we have also flexible
- 00:04:35payment options but between you and me
- 00:04:38money is not a problem so suppose you
- 00:04:42don't care about that of course we have
- 00:04:45certified remote lab instructors and
- 00:04:49course world that you can use for that
- 00:04:52same atom so we really urge you to use
- 00:04:58Mellanox academy not for maybe your own
- 00:05:01personal use but your environment and
- 00:05:04your partners and this is about Mellanox
- 00:05:09academy this is decided by the way of
- 00:05:11the Mellanox academy maybe try to use it
- 00:05:15later and see what are the options and
- 00:05:18that was my introduction regarding
- 00:05:21Mellanox academy and let's start with
- 00:05:26the main target that we would like to
- 00:05:30cover today so the session is called
- 00:05:34infinite vendor search essentials every
- 00:05:37HPC expert must know so you are already
- 00:05:40HPC experts so I suppose you should know
- 00:05:45their following subjects some of you are
- 00:05:49familiar with all of them some of you
- 00:05:51are familiar with nothing some of you
- 00:05:53are familiar with part of that let's try
- 00:05:56to cover together again what we are
- 00:05:59going to look at the IV principles we
- 00:06:02are going to talk about the fabric
- 00:06:04components the network components and
- 00:06:06the fabric architecture we are going to
- 00:06:10talk about the network of fabric
- 00:06:14discovery mainly the jobs of the subnet
- 00:06:19manager which is in charge of basically
- 00:06:22everything in the network we are going
- 00:06:26to talk about the main things the main
- 00:06:30things that happens and the main
- 00:06:32functions of each one of the InfiniBand
- 00:06:37Oh to cool layers and we'll talk a bit
- 00:06:41about Mellanox products this dissolve
- 00:06:44the things that I'm going to cover today
- 00:06:46of course if you will have specific
- 00:06:49questions or questions that will arise
- 00:06:52during my sessions you're free to ask I
- 00:06:56hope I would be able to provide answers
- 00:06:59to all questions and in case I won't be
- 00:07:04able I promise to provide you the right
- 00:07:07answers by mail again my name is a dead
- 00:07:12or dead pass from Ella Knox I'm not
- 00:07:15going to ask you for all your names now
- 00:07:18it's beyond 20 people so I'm not going
- 00:07:20to do that but if you ask questions I
- 00:07:24will be happy if you are able to present
- 00:07:27yourself well you are coming from what
- 00:07:30is your job in the organization which
- 00:07:32organization etc thank you very much and
- 00:07:35we will begin something to ask before I
- 00:07:38begin no okay
- 00:07:44ah so when we talk about the InfiniBand
- 00:07:49a InfiniBand environment
- 00:07:52InfiniBand fabric InfiniBand network are
- 00:07:57going to provide and provide the best
- 00:08:00solution today especially to move
- 00:08:05traffic to move data in HPC environment
- 00:08:10and the reasons are because it does
- 00:08:13provide three or four features that are
- 00:08:18very very important for the users for
- 00:08:21the customers for the clients for the
- 00:08:24applications of the HPC the
- 00:08:27high-performance computing environment
- 00:08:30if you look at the slide that we see
- 00:08:34here then a network usually a fabric
- 00:08:38will have its clients the clients will
- 00:08:42be a servers that are compute servers or
- 00:08:46storage servers that have to move
- 00:08:48traffic between them
- 00:08:50in order to provide the switching
- 00:08:53between them while providing the
- 00:08:55InfiniBand switches which are working
- 00:08:57with the InfiniBand protocol the
- 00:09:01connections between the compute nodes to
- 00:09:05the switches or the storage nodes to the
- 00:09:07switches are used with InfiniBand
- 00:09:11links those InfiniBand links today are
- 00:09:15going in pace in a speed of 56 gigabit
- 00:09:20per second which is gold FDR we're going
- 00:09:24to talk about it later
- 00:09:25and later this year the maximum rate per
- 00:09:29port will reach 100 gigabit per second
- 00:09:33in a speed rate that is gold or standard
- 00:09:37that each specific part is called the
- 00:09:40EDR enhanced data rate so currently we
- 00:09:44are in the generation that allows us
- 00:09:45spell port of silver and purple to the
- 00:09:49switch of up to 56 gigabit per second if
- 00:09:55we talk specifically about Mellanox
- 00:09:58technology Mellanox technology
- 00:10:00generation today is providing the
- 00:10:04channel adapter on the server side of
- 00:10:07the network adapter what is called the
- 00:10:09HCA host channel adapters in a
- 00:10:12generation which is called connect X ray
- 00:10:14and connect IV I'm going to talk about
- 00:10:17the differences between them and the
- 00:10:20switches generation that do provide the
- 00:10:23FDR capability are called switch X -
- 00:10:27when you talk about the InfiniBand
- 00:10:31environment and InfiniBand technology
- 00:10:33you have specific integrated circuits
- 00:10:38for that Isis or chips if you like and
- 00:10:41in our case Mellanox is providing and
- 00:10:44producing and planning those chips which
- 00:10:47are part of the general adapters and
- 00:10:49parts of the switches as we will say we
- 00:10:53have the adapter cards that in the
- 00:10:55InfiniBand I'll call the host channel
- 00:10:58adapters of different types we have
- 00:11:00different types because they provide
- 00:11:04capabilities for different users and
- 00:11:07customers according to to the need and
- 00:11:10of course we have different type of
- 00:11:12switches and the the most de or the main
- 00:11:16reason for the difference between the
- 00:11:18switches will be the capacity capacity
- 00:11:20of poles capacity of servers that each
- 00:11:24one of the switches may actually support
- 00:11:26so we are talking about the switches
- 00:11:28options or InfiniBand switches that are
- 00:11:30going between range of 12 ports up to
- 00:11:35648 ports for a specific switch and then
- 00:11:40of course in order to to allow us to
- 00:11:44enhance performance of customers to
- 00:11:49enhance performance of applications we
- 00:11:52also have a different type of software
- 00:11:55solutions begins with management
- 00:11:57solution like the ufm unified fabric
- 00:12:00manager we are talking about the FCA
- 00:12:03which is now part of the standard
- 00:12:06actually part of redhead the FCA
- 00:12:10accelerator or fabric collection
- 00:12:13accelerations we have VSA for storage
- 00:12:17virtual storage accelerator and other
- 00:12:20options that will help you actually as
- 00:12:24part of your solution that you provide
- 00:12:27to your customers to provide better
- 00:12:30latency better performance better usage
- 00:12:34of the CPU and hardware resources of
- 00:12:40course part of the part of the network
- 00:12:43usually will be the media the media
- 00:12:47between the servers to the switches the
- 00:12:49media of the cables the cables are of
- 00:12:51two main categories the copper cables
- 00:12:54and the fiber cables this is the first
- 00:12:56category beyond that category of copper
- 00:13:00cables and fiber cables we have the
- 00:13:02other category which is which talks
- 00:13:05about is it a passive type of cable or
- 00:13:08active type of cable just to make it
- 00:13:13short and simple usually when we are
- 00:13:15talking about the difference between
- 00:13:18passive cable and active cable active
- 00:13:21cable will allow you to have a longer
- 00:13:23distance between two ends or two nodes
- 00:13:26in the fabric so when we talk about the
- 00:13:32targets targets of belanov solid targets
- 00:13:35of the InfiniBand technology today we
- 00:13:38are talking about different type of
- 00:13:40applications like financial services
- 00:13:42cloud database enterprises web tool and
- 00:13:46of course the most important is HPC it
- 00:13:50would be different if I'm going to talk
- 00:13:52in the cloud environment but for this
- 00:13:54sessions the HPC this is the most
- 00:13:57environment a most important one so we
- 00:14:01are talking about actually advantage and
- 00:14:05better performance enhanced performance
- 00:14:07to the HPC environment when you are
- 00:14:10working with
- 00:14:12InfiniBand specifically and of course
- 00:14:15with some of our Mellanox solutions that
- 00:14:19will allow you to provide ten times
- 00:14:23better performance higher usage or
- 00:14:27better usage of your cpu resources
- 00:14:30better latency and of course very high
- 00:14:34bandwidth from the net for the server
- 00:14:38side to the switches side in order to
- 00:14:42provide it of course we need different
- 00:14:45components in the network to be data and
- 00:14:49this is actually the job of the
- 00:14:52technology the job of the fabric that
- 00:14:55will provide the service for you so what
- 00:15:00we are going to see today is what are
- 00:15:03the basics again of the technology what
- 00:15:06are the main components what is their
- 00:15:08job and how could we make it better as I
- 00:15:13was saying before in order to provide
- 00:15:17the main components in the network
- 00:15:20Mellanox
- 00:15:22has the technology for the channel
- 00:15:25adapter which is today in a generation
- 00:15:27which is called connect x3
- 00:15:31and in the switch's environment we are
- 00:15:34talking about Switch x2 both of those
- 00:15:38guys are able to provide today FDR 56
- 00:15:43gigabit per second purport of the
- 00:15:47channel adapter or purport of the switch
- 00:15:54about to connect x3 as you can say
- 00:15:57connect x3 will allow you to connect on
- 00:16:00the bus side on the server side to
- 00:16:03connect with PS PCI generation number 3
- 00:16:08PCI s3 will be supported by connect x3
- 00:16:13channel adapter cards any question in
- 00:16:16this stage no okay so when we talk about
- 00:16:24InfiniBand InfiniBand generation and
- 00:16:27finland planning what what is the plan
- 00:16:30ahead what is the history of the
- 00:16:32InfiniBand so InfiniBand is the network
- 00:16:35InfiniBand Eze
- 00:16:36as a pipe provider was planned in around
- 00:16:412000 and in around 2000 the port
- 00:16:46capability in the InfiniBand environment
- 00:16:48it was huge for that specific period we
- 00:16:54could have provided 10 gigabit per
- 00:16:57second per port and here was 2001 in
- 00:17:042005 we came to the second generation
- 00:17:07that was called DDR double data rate
- 00:17:11with the capability of 20 gigabit per
- 00:17:14second per port the next generation that
- 00:17:17I suppose part of you're still working
- 00:17:21with was a QD au q dr r is able to
- 00:17:26provide 40 gigabit per second per port
- 00:17:29can you tell me who in the audience has
- 00:17:32in his cluster kudiye kudiye so 70
- 00:17:39percent is actually still working with q
- 00:17:41do so
- 00:17:43can you also tell me what is the
- 00:17:46capability in data that every QDR port
- 00:17:50can provide can support how much data
- 00:17:56can you forward on every QDR port you
- 00:18:04are pointing that and this is you're
- 00:18:06saying 40 gigabit what is your name yep
- 00:18:12get really saying 40 gigabit do you
- 00:18:15agree well he sank 32 okay so uh well
- 00:18:25QDR
- 00:18:26provides 40 gigabit per second of signal
- 00:18:30you may say so the signal the bits if
- 00:18:33you will con them you will see 40
- 00:18:34gigabit per second but actually we will
- 00:18:38see that in the QDR generation we have
- 00:18:40encoding which is got 8 by 10 it
- 00:18:43actually takes 20% of the line rate if
- 00:18:47you like and provide it for for the
- 00:18:50overhead encoding overhead so this is
- 00:18:54the QDR generation both of you also have
- 00:18:58ports a general adapters and switches of
- 00:19:02the FDR generations who has FDR today so
- 00:19:07we're coming to 10%
- 00:19:10FDR 56 gigabit per second and this is
- 00:19:14today and if we are going to talk about
- 00:19:18the end of this year I don't want to
- 00:19:20provide you specific dates then we would
- 00:19:24be able to provide also 100 gigabit per
- 00:19:28second per port and this is the new
- 00:19:33generation or the current generation at
- 00:19:35the end of 2014 which will be the idea
- 00:19:38generation enhance data rate 100 gigabit
- 00:19:43per second per port of your server 100
- 00:19:46gigabit per second per port on the
- 00:19:49switch
- 00:19:51of course the other a element the other
- 00:19:54parameter which is very important
- 00:19:56specifically for high performance
- 00:19:58compute will be the latency how much
- 00:20:02time does it take for a packet for
- 00:20:05calculation to go between one server to
- 00:20:09the other server and this is actually
- 00:20:12the latency the latency is the latency
- 00:20:15when I'm talking here about the latency
- 00:20:16what is the latency that I am as a
- 00:20:19switch as InfiniBand switch I'm going to
- 00:20:22add to the packet that comes into my
- 00:20:26switch and then goes out I'm not going
- 00:20:29to talk about the latency that is
- 00:20:31actually happens if you like in the
- 00:20:33server it happens in the process between
- 00:20:37the application upper layers to the
- 00:20:39lower layers I'm talking about the
- 00:20:42latency that is added by the switch the
- 00:20:46latency that is added by the switch
- 00:20:48specifically we will talk about it it
- 00:20:50may be again later when a packet comes
- 00:20:54into InfiniBand switch of Mellanox in
- 00:20:58our example until it goes out every
- 00:21:01switch or every hope of a switch will
- 00:21:06add 117 nano second of course you can
- 00:21:12tell me all that but usually we are
- 00:21:14going when a packet enters to the fabric
- 00:21:17many times it is actually passing more
- 00:21:21than one switch more than one hop it may
- 00:21:25go to the first switch and then to the
- 00:21:28second switch and then to the third
- 00:21:30switch so how do you count that of
- 00:21:32course I'm going to add the delay so if
- 00:21:37i have delay 170 seconds in one switch
- 00:21:41and then another one and then another
- 00:21:43one
- 00:21:43then i'm going to to count for five to
- 00:21:47seven hundred nanoseconds of delay which
- 00:21:49is still very very short and good delay
- 00:21:53and this is in the color generation of
- 00:21:56FDR
- 00:21:59but what was the beginning of the
- 00:22:02InfiniBand the beginning of the
- 00:22:04InfiniBand 1999 there is organization
- 00:22:08which is called was set up organization
- 00:22:12that was called the IB ta InfiniBand
- 00:22:15trade association found in 1999 and the
- 00:22:19target of the IB ta was to provide the
- 00:22:22network and the protocol for data Sony
- 00:22:27this is not something which is new we
- 00:22:31already had networks like X 25 and is
- 00:22:35the N and Ethernet and Fibre Channel and
- 00:22:39frame relay
- 00:22:41so all those network are actually also
- 00:22:45providing data and networks so what is
- 00:22:49the difference what is the main
- 00:22:52difference or what is one of the main
- 00:22:54differences between date data network of
- 00:22:57InfiniBand to the other data network so
- 00:23:00first it was created it was thought
- 00:23:02about in 2000
- 00:23:04so in 2000 they were thinking already
- 00:23:07about customers of the 2000s
- 00:23:10applications of mm performance capacity
- 00:23:15latency application HP sees that cannot
- 00:23:21work with the other protocols because
- 00:23:24they are too slow because they are too
- 00:23:27narrow and because also because of
- 00:23:32another a very important factor one of
- 00:23:36the things that was not thought about in
- 00:23:38the other network like Ethernet
- 00:23:41switching they were not thinking about
- 00:23:45what happens to a packet of data to a
- 00:23:49message of data in the server itself
- 00:23:54when you are taking a packet of data in
- 00:23:58the tcp/ip environment that packet of
- 00:24:01data is taken and then it is put into
- 00:24:05frames in the higher layers and then it
- 00:24:09goes in the server to the Kern
- 00:24:12environment to the tcp/ip Colonel
- 00:24:15environment and what happens there that
- 00:24:19the colonel environment has two poses
- 00:24:21each one of those packets each one of
- 00:24:26the data packets has to be processed in
- 00:24:28the kernel environment and then to be
- 00:24:30sent to be copied to another buffer and
- 00:24:33then from the buffer to be sent to the
- 00:24:35channel adapter and in the channel
- 00:24:38adapter to send it out that is normally
- 00:24:42what happens in the other type of
- 00:24:44protocols one of the things that was
- 00:24:47thought about in the InfiniBand was how
- 00:24:52to reduce the time that the packet will
- 00:24:56take from the application layer until it
- 00:25:01goes out from the server to the other
- 00:25:04side that time has to be shortened that
- 00:25:07time has to be cut and this is something
- 00:25:11which is very unique to the InfiniBand
- 00:25:15protocol and InfiniBand environment how
- 00:25:18do you make it more efficient how do you
- 00:25:22make it more short how do you make the
- 00:25:26latency better within this distance
- 00:25:29between the higher layer protocols to
- 00:25:33the channel adapter before it goes out
- 00:25:35to the other side so this is also
- 00:25:37something that was a part of the
- 00:25:40thoughts of the guys that were behind
- 00:25:43the cavity of the InfiniBand behind the
- 00:25:48planning of the InfiniBand like IBM ella
- 00:25:51knox oracle HP cray intel and others
- 00:25:54that probably i did not count here so
- 00:25:59the i InfiniBand is actually switched
- 00:26:02fabric architected architecture
- 00:26:04interconnected technology connecting
- 00:26:06cpus and iOS which provides super high
- 00:26:12performance which provides high
- 00:26:14bandwidth okay starting at 10 gigabit
- 00:26:18per sector per second up to - ha 100
- 00:26:22gigabit per second pair
- 00:26:24port purport of a silver per port of the
- 00:26:28switch the low-latency that is required
- 00:26:32by your customers should be challenged
- 00:26:35by the protocol and should be responded
- 00:26:39by the provided protocol and as I was
- 00:26:41telling you they minimal latency the
- 00:26:45today's switch will head is 117 nano
- 00:26:50second and the other challenge that I
- 00:26:54was talking about how do we cut how do
- 00:26:59we enhance the performance of the
- 00:27:01application and the way that it sends
- 00:27:03the information from the higher layer
- 00:27:05protocols until it goes to the channel
- 00:27:08adapter and the main mechanism which is
- 00:27:12used for that and I suppose all of you
- 00:27:16probably familiar with is using the
- 00:27:21process using the mechanism which is
- 00:27:24called the our DMA remote direct memory
- 00:27:30access which will provide a better
- 00:27:34latency for every packet that goes from
- 00:27:36the upper layer protocol up to the our
- 00:27:40channel adapter and will also allow us
- 00:27:43to have a lower CPU a or better CPU
- 00:27:49utilization better CPU utilization
- 00:27:51because most of the traffic that was
- 00:27:55processed until the usage of all the our
- 00:27:58DMA that was processed in the kernel now
- 00:28:03should not be processed in the kernel
- 00:28:06this is one of the advantages of the our
- 00:28:10DMA so traffic communication now in the
- 00:28:15technology that we are planning the
- 00:28:18traffic communication will bypass the
- 00:28:20operating system will bypass the kernel
- 00:28:24and it means that the CPU could be used
- 00:28:29through much more processes which are
- 00:28:33other than
- 00:28:35dealing with the traffic itself
- 00:28:42InfiniBand was originally designed for
- 00:28:44loud scale grids and clusters it came to
- 00:28:49increase the application performance it
- 00:28:51can provide solutions for local area
- 00:28:54network for an for a bigger network for
- 00:28:58storage area networks and application
- 00:29:00communications it will provide high
- 00:29:04reliability cluster management what does
- 00:29:08it mean high reliability cluster
- 00:29:10management well this is a bit vague I
- 00:29:14must say but the high reliability of
- 00:29:17course means that if you have something
- 00:29:19or an entity that will manage the
- 00:29:22network and that entity fails down
- 00:29:25automatically you will have a backup to
- 00:29:28that entity that will take its place we
- 00:29:32are talking about the first Network
- 00:29:34which is an SDN Network Rail Sdn network
- 00:29:40which means software-defined network
- 00:29:43software managed Network which means
- 00:29:47that we don't need someone we don't need
- 00:29:50an administrator that will configure all
- 00:29:54parameters all the parameters in your
- 00:29:59network all the parameters up to a level
- 00:30:01of a port can be defined today by the
- 00:30:05entity that manages the network by
- 00:30:08software and that entity that manages
- 00:30:11the InfiniBand network by software is
- 00:30:14called the subnet manager we're going of
- 00:30:18course to talk about it before we
- 00:30:21continue
- 00:30:21since our DMA is one of the main
- 00:30:25processes or one of the main solutions
- 00:30:27here
- 00:30:28what is the advantage of our DMA our DMA
- 00:30:31again for those well somebody hill does
- 00:30:34not know what our DMA is and please
- 00:30:37raise your hand okay
- 00:30:42so don't laugh if you have that question
- 00:30:45it seems funny there are people in the
- 00:30:48world which are not familiar with our
- 00:30:50DMA okay so since everybody knows what
- 00:30:55our DMA is and I suppose that you know
- 00:30:57that without our DMA we have the
- 00:31:00application message that goes on the
- 00:31:03buffers on the user layer and from the
- 00:31:06buffers in the user layer they have to
- 00:31:08be copied every packet has to be copied
- 00:31:12to the buffers in the operating system
- 00:31:14level and from the operating system
- 00:31:17level after it is processed in the
- 00:31:20tcp/ip layers it has to be copied again
- 00:31:24to the channel adapter layer the channel
- 00:31:28adapter or the new claret is as it is
- 00:31:31called here or as it is called in the
- 00:31:33InfiniBand environment host channel
- 00:31:37adapter so every packet is to go user
- 00:31:40copy caramel copy a caramel process
- 00:31:43kernel copy to the channel adapter and
- 00:31:46go then goes to the other side this is
- 00:31:50what happens without our DMA without DMA
- 00:31:53we have a zero copy we have a zero copy
- 00:31:57we don't have to copy the packets which
- 00:32:00are the data packets we don't have to
- 00:32:02copy them to the kernel we don't have to
- 00:32:05copy them to the kernel so we saved it
- 00:32:07we save the time we save the copy
- 00:32:10process and we just take those packets
- 00:32:14directly from the user layer of one
- 00:32:18server we send it directly to the user
- 00:32:22layer of the other server and this is
- 00:32:25the RDMA over InfiniBand
- 00:32:27that saves time saves latency enhance
- 00:32:32the latency and of course allows us to
- 00:32:35do much more work in the operating
- 00:32:38system this is actually we can probe
- 00:32:41again again between 60 to 80 percent of
- 00:32:46improvement all efficiency if you like
- 00:32:49in the cpu
- 00:32:53because we don't need to process the
- 00:32:55traffic in the operating system in the
- 00:32:59kernel so let's look at the InfiniBand
- 00:33:03architecture InfiniBand architecture
- 00:33:06this is the basic structure like most of
- 00:33:11the other data network that you know do
- 00:33:14you have the laser pointer are working
- 00:33:16in your yeah probably
- 00:33:30okay data networks are Luke's looks
- 00:33:37actually very similar if you'll take
- 00:33:39Ethernet frame relay ISDN X 25 whatever
- 00:33:44data network it will look like that
- 00:33:47right so we have to talk about the
- 00:33:51network itself and what are the parts
- 00:33:53which are special in those network so
- 00:33:56first let's talk about our customers our
- 00:33:59customers are yours your customers as
- 00:34:02well those are the HPC computing systems
- 00:34:05or computing servers and the HPC storage
- 00:34:09servers so those are my customer here I
- 00:34:12have the application let's say that one
- 00:34:15of the most common applications in your
- 00:34:17environment will be a openmpi protocol
- 00:34:21will be GPFS nvidia or some other type
- 00:34:27of applications which are working here
- 00:34:30on your clients then we are taking that
- 00:34:33application and that application
- 00:34:35calculation result is to go to the other
- 00:34:39computation a server in order to move
- 00:34:42the traffic to the other side we have to
- 00:34:45take that packet and then we have to go
- 00:34:49out to the other server how do we go to
- 00:34:51the other server for the purposes are we
- 00:34:55need switches the switches of doing the
- 00:34:57switching so we are taking packets from
- 00:35:00server number X to server number Y later
- 00:35:04we are going to talk about addresses and
- 00:35:06addressing so we are going to take the
- 00:35:09packet and send it via the network cloud
- 00:35:12of the InfiniBand environment the
- 00:35:15network out of the InfiniBand
- 00:35:17environment is called the HCA the hosts
- 00:35:20channel adapter the hosts channel
- 00:35:23adapter has actually two sides one side
- 00:35:26is the side of the bus of the first
- 00:35:29server which is the PCI bus and the
- 00:35:32other side is the InfiniBand link the
- 00:35:35InfiniBand link could go with the
- 00:35:38relevant media to the switch what is the
- 00:35:42relevant
- 00:35:42idea the relevant media could be a Capel
- 00:35:45if you would like to have higher
- 00:35:47distance instead of copying or taking
- 00:35:50fiber right okay so you took the
- 00:35:54InfiniBand link up to the switch that
- 00:35:57switch may have different type of our
- 00:35:59ports in terms of capacity it could be a
- 00:36:02switch that takes 12 volts it could be a
- 00:36:05switch to take 648 ports in order to
- 00:36:10move the traffic between debt server
- 00:36:12server a to debt service lb I have to go
- 00:36:16to at least one switch and sometimes to
- 00:36:20several switches so therefore we are
- 00:36:22going into the network to are between
- 00:36:25one switch to another switch and then we
- 00:36:29send information to the end destination
- 00:36:33the data network or the fabric the
- 00:36:38InfiniBand fabric here is working today
- 00:36:43in a subnet or one subnet only the data
- 00:36:49network today is working as a layer 2
- 00:36:54data network what is the difference
- 00:36:58between layer 2 data network to layer 3
- 00:37:01data network
- 00:37:08now this is a question that I know that
- 00:37:10maybe some of you do not know this is
- 00:37:13why I'm asking unlike our DMA yes IP
- 00:37:20packet okay well of course you're right
- 00:37:29about in part of the answer somebody
- 00:37:33would like to expand No
- 00:37:36okay so when I'm talking about one
- 00:37:39subnet about switching of layer 2
- 00:37:42switching layer 2 means that all of my
- 00:37:46nodes so the difference is wrong ok so
- 00:37:51all of my nodes belong to the same
- 00:37:53subnet right now when I pee is a term
- 00:38:00from another environment ok because in
- 00:38:05the InfiniBand we have different types
- 00:38:07of addressing okay so one subnet means
- 00:38:11that all my nodes belong to the same
- 00:38:14subnet to the same subnet therefore if
- 00:38:17you have 20 nodes or two thousand nodes
- 00:38:21all of them are part of the same subnet
- 00:38:25and the switches currently here our
- 00:38:29switches of layer 2 so it doesn't matter
- 00:38:34of which organization here you are
- 00:38:36coming today you're working with layer 2
- 00:38:41network in the InfiniBand environment so
- 00:38:45those are our switches about your
- 00:38:48dressing we're going to talk in the next
- 00:38:50stage so in most of your environments
- 00:38:55you have the InfiniBand environment for
- 00:38:58the HPC a for calculation for GPFS or
- 00:39:02nvidia for storage for computing edge
- 00:39:04storage and most of it is InfiniBand in
- 00:39:08some of your network you have kind of
- 00:39:10mixed environment and mixed environment
- 00:39:14environment means
- 00:39:16that maybe maybe two years from today or
- 00:39:21twins are back
- 00:39:23you had another network that is maybe
- 00:39:25Ethernet network or fiber channel
- 00:39:29network and sometimes you need to move
- 00:39:33information between the part which is
- 00:39:36the InfiniBand part
- 00:39:38InfiniBand environment to the part which
- 00:39:41is the ethernet environment so you have
- 00:39:44two computers one computer one node is
- 00:39:48in the ethernet environment and one node
- 00:39:51is in the InfiniBand environment how do
- 00:39:55you move traffic between two different
- 00:39:57environments the answer is how using
- 00:40:05routing no the answer is thank you by
- 00:40:12the way for the response the answer here
- 00:40:15is a gateway it's not very far of course
- 00:40:20from the answer the answer is a gateway
- 00:40:23what the Gateway is a gateway is
- 00:40:25actually if you like a converter
- 00:40:28a converter between two protocols a
- 00:40:31converter between two languages so what
- 00:40:35I would like to do here in my example I
- 00:40:37would like to take a packet that is on a
- 00:40:40server here in that environment - a
- 00:40:43packet to a server in that environment
- 00:40:46because here they talk InfiniBand ish
- 00:40:51you know that language and here they
- 00:40:55talk Ethernet we have actually to take
- 00:40:58packets from the ib2 Ethernet and from
- 00:41:02the ethernet - to IB for that purpose we
- 00:41:05have two gateway this is the job of the
- 00:41:08Gateway to enable us to talk between
- 00:41:11ethernet InfiniBand or fiber channel to
- 00:41:14InfiniBand the component is called
- 00:41:16gateway good clear
- 00:41:22usually the information or usually the
- 00:41:25type of packets of course that will go
- 00:41:27between the ethernet environment to the
- 00:41:30InfiniBand environment will be IP
- 00:41:33packets we are using in the IB we are
- 00:41:38going to talk about it we are using a
- 00:41:41function which is called IP over IP IP
- 00:41:48over IP that will be the way we are
- 00:41:52going to take packets IP packets here
- 00:41:54and send them to IP packets on the
- 00:41:57internet environment IP over IP now
- 00:42:01let's talk about the management entity
- 00:42:04one of the things that will make the
- 00:42:07InfiniBand environment special and
- 00:42:09unique is the capability of the
- 00:42:12InfiniBand to be Sdn software-defined
- 00:42:16Network and software-defined network
- 00:42:19somebody here has managed internet
- 00:42:21environment working with switches of
- 00:42:27Ethernet whoo you're saying you're
- 00:42:32nodding like Gabriel so if you would
- 00:42:35like for example to change parameter in
- 00:42:40on a villain or trunk or speed what do
- 00:42:44you have to do you have to do it
- 00:42:45yourself so it's kind of do-it-yourself
- 00:42:47network but you have to do it yourself
- 00:42:50so you need an administrator in Ethernet
- 00:42:54environment in InfiniBand environment
- 00:42:57actually as I was saying it is actually
- 00:42:59the first network type that is Sdn
- 00:43:03software-defined network so who is that
- 00:43:07entity the entity in the InfiniBand
- 00:43:10protocol that entity is called subnet
- 00:43:14manager sm the subnet manager is the
- 00:43:18most important entity in the network
- 00:43:21because it does all the configuration up
- 00:43:28to a level port of each one of the
- 00:43:31switches of each one of the cell
- 00:43:35every parameter is managed by the subnet
- 00:43:39manager this is the configuration you
- 00:43:44don't need any administrator to do any
- 00:43:47changes here the subject manager will do
- 00:43:49it all it does work we have we have an
- 00:44:05unsatisfied customer
- 00:44:08so the way to do that today in marketing
- 00:44:13they are saying you're right Gabriel
- 00:44:14what is the problem we're going to solve
- 00:44:17it it's not that I'm dissatisfied is
- 00:44:20just it's not just switching ok
- 00:44:25first let's talk about it ok but we will
- 00:44:29do it in one of the next stages ok so
- 00:44:36this is the subnet managers that are
- 00:44:38going according to the theory in most of
- 00:44:41the time should work and the subnet
- 00:44:44manager is doing everything what happens
- 00:44:49if that subnet manager fails if the
- 00:44:56subnet manager phase let's call him the
- 00:44:58actives that a subnet manager if the
- 00:45:01active subnet manager fails of course we
- 00:45:03need another subnet manager in the
- 00:45:07network the other subnet manager that
- 00:45:10will be somewhere here will be the
- 00:45:12standby or the fallback subnet manager
- 00:45:17so we have in every network we must have
- 00:45:20a fallback and an active subnet manager
- 00:45:24by the way the protocol itself or the
- 00:45:28standard itself of the InfiniBand allows
- 00:45:31you to have more than two subnet
- 00:45:33managers we suggest to have only two
- 00:45:36subnet manager in a in the network so we
- 00:45:41were talking about currently about a
- 00:45:43layer 2 network although it will support
- 00:45:46in the future in layers
- 00:45:48the main components are the customers
- 00:45:51itself the notes channel adapter that
- 00:45:54will allows you to do the network
- 00:45:55connection
- 00:45:56the InfiniBand link itself that goes
- 00:45:59from the server to the switch the
- 00:46:01switches that currently are doing layer
- 00:46:032 switching the nodes belong all of them
- 00:46:07to the same network or to the same
- 00:46:09subnet and in order to move packets
- 00:46:13between InfiniBand environment to other
- 00:46:16type of protocol environment we have a
- 00:46:19functionality which is called gateway
- 00:46:22questions ok so we talked about the
- 00:46:29different sorry the different elements
- 00:46:34and first the host channel adapter the
- 00:46:40host channel adapter device that
- 00:46:43terminates the InfiniBand link here we
- 00:46:46have the connector this is the host
- 00:46:48channel doctor looks like I suppose
- 00:46:52other channel adapters that you have
- 00:46:54seen up today but the connector is
- 00:46:57different the connector today in the
- 00:46:59InfiniBand is called q SFP connector for
- 00:47:05those of who for those of you I suppose
- 00:47:08small number who are not familiar with
- 00:47:10that then a QFP is a different connector
- 00:47:13is not SSP it's not as a free plus it's
- 00:47:17not rj45 it is q SFP so we have a QFP
- 00:47:23connector that goes out to the towards
- 00:47:27the switch this is the channel adapter
- 00:47:28of course it's part of the server and
- 00:47:31this is decided go to the bus of the
- 00:47:34server the PCIe bus the PCI generation
- 00:47:38today that we are talking about is PCI
- 00:47:41generation 3 then we have the switches
- 00:47:45the switch is a device that moves
- 00:47:47packets from one link to another or from
- 00:47:50one node to another on the same
- 00:47:53InfiniBand subnet on the same InfiniBand
- 00:47:56subnet this is why we call it layer 2
- 00:48:01reaching okay then we have an element
- 00:48:06which is called the router and now you
- 00:48:10can ask me a dead but you told us that
- 00:48:12we are working in layer 2 network so why
- 00:48:17do you need a router the answer is that
- 00:48:24currently practically we are not working
- 00:48:27without us currently we have router in
- 00:48:31the protocol we have router in the
- 00:48:34standard we have layer three for routing
- 00:48:37of InfiniBand we have addresses for
- 00:48:41routing of InfiniBand routing will allow
- 00:48:44us to move traffic between one subnet to
- 00:48:49another submit so we have it in the
- 00:48:53stand out we everything the theory
- 00:48:56currently it's not yet implemented
- 00:49:01almost 100% and then we have the gateway
- 00:49:05or a Bridget it is it called and the
- 00:49:08Gateway job will be to Anne I enable us
- 00:49:10to move packets between InfiniBand
- 00:49:15environment to ethernet environment Auto
- 00:49:19fibre channel environment we talked
- 00:49:23about the channel adapter and we said
- 00:49:26that the journal adapter is connected
- 00:49:28with a link to the switch the link could
- 00:49:32work with different type of speeds it
- 00:49:36could be SDR Oh d-dear
- 00:49:39o QD r o FD r at the end of the year it
- 00:49:43will be also ideal this is the way if
- 00:49:47you like so if we are going back to 2001
- 00:49:51the only capability was HDL single data
- 00:49:56rate then we went to a detail double
- 00:50:01data rate and then we have a queue do a
- 00:50:05quadruple data rate and then we'll go
- 00:50:08into FDR 14 data right this is the car
- 00:50:11and generation and at the end of the
- 00:50:14or God's will we're going to have the
- 00:50:17India intense data wretches I think it's
- 00:50:20actually a what 25 words alien not 25
- 00:50:28sorry it's a mistake well please edit it
- 00:50:36in the dick yeah well ah actually it was
- 00:50:43put there intentionally to check if you
- 00:50:46are really checking what is said here
- 00:50:50right
- 00:50:51okay so that was easier let's now go to
- 00:50:54the addressing when we talk about the
- 00:50:58dresses of protocols first we have the
- 00:51:01physical address each one of us is a
- 00:51:04physical address kind of genes I suppose
- 00:51:08oh whatever the physical address in the
- 00:51:12environment of InfiniBand is called
- 00:51:15gooood gooood is global unique
- 00:51:20identifier of that specific note so all
- 00:51:29the elements in the environment are
- 00:51:31nodes each node will have his own good
- 00:51:34who is a node a channel adapter of a
- 00:51:38server is a node a switch is also a node
- 00:51:45okay a gateway is also a node so each
- 00:51:49one of them is having physical address
- 00:51:53which is a bit like a Mac if you're not
- 00:51:57familiar with good yet then you can
- 00:51:59compare it to a Mac so this is my
- 00:52:03physical risk global unique identifier
- 00:52:09the global unique identifier is an
- 00:52:12address which is comprised of 64 bits
- 00:52:16part of those bits of course will be the
- 00:52:19vendor ID ok 64 bits
- 00:52:24it is assigned by the InfiniBand vendo
- 00:52:28and it is persistent through reboot so
- 00:52:31if you're taking that channel adapter
- 00:52:32with it support if the server goes down
- 00:52:36and up the good is going to change yes
- 00:52:43the going is the good is going to change
- 00:52:46no okay that was a right so it is
- 00:52:52persistent it is a fixed one the gooood
- 00:52:55is a fixed parameter fix identification
- 00:52:58of the node now this is for channel
- 00:53:04adapter so now here I would like to
- 00:53:06emphasize that point every port in a
- 00:53:11channel adapter has its own gooood do
- 00:53:15some of you have more than one port on
- 00:53:17the same server it's like two ports on
- 00:53:21the same server yes
- 00:53:23how many goods do you have for two ports
- 00:53:28one for each port right so this is
- 00:53:31obvious Satna who is working with switch
- 00:53:35with 36 ports your name oh sorry Chris
- 00:53:47Chris how many goods do you have two
- 00:53:49deaths which with 36 ports one good
- 00:53:54right so in a switch you do not have a
- 00:53:58specific gooood to every port the switch
- 00:54:02you have only one Guido at least one
- 00:54:05main grid let's call it a node grid of
- 00:54:08the switch so the switch has the node
- 00:54:12grid so how will we identify specific
- 00:54:16port in that switch if you have felt its
- 00:54:22exports how will you identify the port
- 00:54:27put a the port number exactly so each
- 00:54:31port has a port number so if you like to
- 00:54:35identify a specific port you actually
- 00:54:37take the gooood of the switch plus the
- 00:54:41port number right
- 00:54:43there is another type of grid which is
- 00:54:46called the system grid someone is
- 00:54:49familiar with system good who has a
- 00:54:55chassis a switch or modular switch which
- 00:55:00is more than 36 ports no one here he'll
- 00:55:04you have which type of very switch big
- 00:55:10fabric one if I have a big fabric one it
- 00:55:13means that on the same chassis on the
- 00:55:15same physical chassis you have actually
- 00:55:17multiple switches so each one of those
- 00:55:21switches will have its own node good but
- 00:55:25the chassis as a whole will have another
- 00:55:28grid which is called a system grid so
- 00:55:32the system go it will be the identifier
- 00:55:34of the whole chassis and each one of
- 00:55:37those switches within the medulla will
- 00:55:40have a specific grid okay IB fabric IB
- 00:55:48fiber fabric basic building block when
- 00:55:52you have InfiniBand fabric InfiniBand
- 00:55:55network that InfiniBand network is built
- 00:55:58according to the needs what are what is
- 00:56:01the basic need usually the basic need
- 00:56:04will be the capacity how many nodes do
- 00:56:07you need in your cluster right of course
- 00:56:12this is where only one parameter so when
- 00:56:17we are talking about the fabric basic
- 00:56:21building block each one of those block
- 00:56:23is going to be a switch by the way each
- 00:56:28one of the switches is based on an ASIC
- 00:56:32what is the ASIC
- 00:56:35actually the ASIC is the heart of your
- 00:56:37switch basic is a switch basic is a chip
- 00:56:42the exit is it is a switching chip and
- 00:56:47today this generation every ASIC has 36
- 00:56:52port why do I say this generation
- 00:56:55because if you're looking at two
- 00:56:57thousand and four or five the ASIC the
- 00:57:00basic ASIC was actually structured of 24
- 00:57:04ports today basic is 36 ports when you
- 00:57:10build a network the network is built
- 00:57:13according again to the number of nodes
- 00:57:15that you have to provide that you have
- 00:57:17to support it could be a network for
- 00:57:21some of you you have I suppose 100-day
- 00:57:24nodes some of you might have 600 nodes
- 00:57:29some of our customers have 10,000 nodes
- 00:57:3510,000 and more so of course the number
- 00:57:39of switches that you will have to
- 00:57:41provide will be a different number
- 00:57:43according to the number of nodes that
- 00:57:46you will have to connect okay here let's
- 00:57:54look at that slide everybody can see the
- 00:57:56slide even from the last lines I hope
- 00:58:00okay so every switch here let's take a
- 00:58:04basic switch in our cluster and as you
- 00:58:10can see the nodes are connected the
- 00:58:13nodes are connected to switches the
- 00:58:17switches that are used for the direct
- 00:58:20connection the direct connection to the
- 00:58:23nodes those switches have a name that
- 00:58:27somebody can tell us what is the name of
- 00:58:28those switches
- 00:58:33yes exactly those switches are called
- 00:58:37the lifts
- 00:58:38okay so Dec denotes the servers or are
- 00:58:43connected to the leaf switches each one
- 00:58:46of those is a leaf now let's say that in
- 00:58:50that example as you can see every leaf
- 00:58:54is going to be connected with 18 ports
- 00:58:58to 18 nodes so 18 ports of dead leaf and
- 00:59:0318 ports of dead leaf and 18 ports of
- 00:59:07dead leaf and so on so if you are using
- 00:59:1118 ports from the leaf to their nodes
- 00:59:15what do you need the idea of 18 ports
- 00:59:18for so I have felt its exports 18 are
- 00:59:27used for the connection to the nodes
- 00:59:30themselves what do I do with those 18
- 00:59:35ports trunks okay the other 18 ports are
- 00:59:41actually used as connections to the
- 00:59:43other to the next layer the next layer
- 00:59:47in the fabric is called how how do you
- 00:59:52call the art those those switches oh yes
- 00:59:59oh it has another name
- 01:00:04backbone also spine spine switches so we
- 01:00:11have two basic type of switches we have
- 01:00:15the leaf switches which are connected
- 01:00:18directly to the server and we have the
- 01:00:20spine switches and the spine switches
- 01:00:24have one main job the spine switches are
- 01:00:29actually will help us to make
- 01:00:32interconnection between servers that do
- 01:00:37not belong to the same to the same what
- 01:00:44live to the same leaf because let's say
- 01:00:47that you have those servers here servers
- 01:00:51that belong to leaf a and here you have
- 01:00:54servers that belong to leaf D how will
- 01:00:57you connect between those servers to
- 01:01:00those of us the answer is you need
- 01:01:03another layer that layer will allow you
- 01:01:06to make inter connection between
- 01:01:11different Leafs you would be able to
- 01:01:15move packets between Detlef to that leaf
- 01:01:20via dead spine okay so this is the spine
- 01:01:26layer this is the job of the spine layer
- 01:01:29to allow connections between traffic
- 01:01:31that goes or belong to different lifts
- 01:01:35now if you paid attention i took 18
- 01:01:40ports to the servers and then I have 18
- 01:01:44poles that are going to the spine
- 01:01:46switches so actually 18 ports to towards
- 01:01:50the servers and 18 ports towards the
- 01:01:53spines for interconnection why did I
- 01:02:00take 18 ports here and 18 ports for
- 01:02:03interconnection
- 01:02:08yes at the same time right so actually
- 01:02:17one of the things you would like as HPC
- 01:02:20customers you would like the network to
- 01:02:23be non-blocking network and one of the
- 01:02:32ways to get close to that target is that
- 01:02:36if for example you have 18 nodes or 18
- 01:02:41servers in order to verify that they
- 01:02:46will get the bandwidth that they need
- 01:02:49when they want to send out information
- 01:02:51to other 18 servers you would like to
- 01:02:56have the same bandwidth of the nodes
- 01:03:01will be supported by equal bandwidth in
- 01:03:05the interconnection fabric so if you
- 01:03:10would like to move 100 gigabit per
- 01:03:13second from the nodes you need 100
- 01:03:17gigabit per second in the
- 01:03:19interconnection environment to support
- 01:03:21it so actually if you have a number of
- 01:03:2818 ports of FDR to the nodes you will
- 01:03:34need 18 ports of FDR in the
- 01:03:38interconnection if you have 72 ports of
- 01:03:42FDR to the nodes you will need 72 port
- 01:03:48of FDR in the interconnection in that
- 01:03:52way you will enhance the chances that
- 01:03:55every time you will have been with that
- 01:03:59is required for me node to be to go out
- 01:04:02to another server you will have that
- 01:04:06bandwidth in the interconnection
- 01:04:08environment this is called non-blocking
- 01:04:11go if you would like virtual or semi non
- 01:04:14blocking
- 01:04:15in the interconnection environment now
- 01:04:20of course it's not the total proof it's
- 01:04:25not 100% but this is one of the rules
- 01:04:30that we use in order to have kind of
- 01:04:34semi a non-blocking environment of
- 01:04:38course we can we could have taken
- 01:04:41instead of 18 4 0 and 18 ports in the
- 01:04:44interconnection environment we could
- 01:04:46have taken for example 27 here and 9
- 01:04:50here ok what happens if you have 27
- 01:04:54ports here and 9 ports that are going to
- 01:04:58the interconnection you have actually a
- 01:05:01different ratio between the
- 01:05:06interconnection pause - the connections
- 01:05:09that are going to the servers and the
- 01:05:12different ratio means that you have
- 01:05:14higher number of node poles then calls
- 01:05:18of interconnection you will get a
- 01:05:21blocking ratio okay let's go to the next
- 01:05:29one if you have questions please ask me
- 01:05:31yes please why do the fighting bull have
- 01:05:3636 ports of example a power of 230 wide
- 01:05:41the switches do not have a different
- 01:05:43number by this choice in this case what
- 01:05:47long standing car 36 ports okay the
- 01:05:50reason is the reason is that the
- 01:05:54integrated circuits which are produced
- 01:05:58the Asics
- 01:05:59which are produced the basic ASIC today
- 01:06:03has 36 ports although I can tell you
- 01:06:07that we also have a switch of 12 ports
- 01:06:10for example which is also possibility
- 01:06:14which is clear because of the hardware
- 01:06:16but exactly
- 01:06:17six when Isaac was designed by 36 was
- 01:06:20I'm not sure I can tell you I can tell
- 01:06:29you that the previous generation was 24
- 01:06:37you're talking about the flip-flops
- 01:06:40inside and so on I suppose anyway this
- 01:06:43is the structure this is the internal
- 01:06:45structure okay so in that example we
- 01:06:48have actually a fabric that will support
- 01:06:5372 volts or 72 notes
- 01:06:57why 72 notes because we have created we
- 01:07:02have created an environment that has
- 01:07:04four lives each one of the Leafs has 36
- 01:07:10ports and demand those lt6 ports we have
- 01:07:1418 poles to the nodes so 18 18 18 18 872
- 01:07:20ports towards the nodes and then we have
- 01:07:2372 ports which are providing us the
- 01:07:27support in the interconnection
- 01:07:28environment question
- 01:07:36so we have four chips which our function
- 01:07:39is lines or Leafs and two chips which
- 01:07:43will function as goals or spines okay so
- 01:07:51that was a main a structure of the
- 01:07:55fabric let's go to the next stage we
- 01:07:57talked about the physical address that
- 01:08:00we call the gooood let's go to the next
- 01:08:02stage when we are moving traffic between
- 01:08:05one node to another node we need to have
- 01:08:08an address that rests which is used is
- 01:08:12the layer to address so when I would
- 01:08:14like to talk with the Chris or when I
- 01:08:17would like to talk with Gabriel each one
- 01:08:21of them will have a layer to address the
- 01:08:24layer to address in InfiniBand
- 01:08:26environment is called lead local
- 01:08:29identifier this is the lead the local
- 01:08:33identifier it has 16 bits 16 bits it
- 01:08:38means that the range that is supported
- 01:08:41could be up to 2 by 16 which is about
- 01:08:45how many addresses about 65,000 right
- 01:08:51yes this is the layer to address range
- 01:08:54who is providing the addresses well one
- 01:08:59function this is the subnet manager each
- 01:09:02one of the nodes in this room is going
- 01:09:05to get lead provided by the subnet
- 01:09:08manager the lid again is called local
- 01:09:14identifier so each one of your servers
- 01:09:17has one or two leads it depends of the
- 01:09:23number of the ports that you are
- 01:09:24connected with it is assigned by the
- 01:09:27subnet manager and it is theoretically
- 01:09:31not persistent doing reboot
- 01:09:37theoretically why do I stay to radically
- 01:09:40if I will give you the option what would
- 01:09:43you like to what do you prefer would you
- 01:09:45like would you like it to be changed
- 01:09:47we sit on the server what who said no no
- 01:09:57yeah you wouldn't like you would like
- 01:10:00you will prefer your network you will
- 01:10:04prefer your fabric to say to stay as
- 01:10:06stable as possible you would like that
- 01:10:09if Greece would like to call Gabriel
- 01:10:12even if great brain went down because of
- 01:10:16some issue with well even in this case
- 01:10:24when it comes up we would like him to
- 01:10:27stay with the same lid so of course
- 01:10:29there are parameters that will allow you
- 01:10:31to say that even if server goes down we
- 01:10:35would prefer that when it comes up we
- 01:10:38would prefer him to come with the same
- 01:10:41layer to address with the same lid when
- 01:10:45we talk about the leader dresses we have
- 01:10:48two ranges of addresses because we have
- 01:10:50two types of communication in the
- 01:10:53InfiniBand environment the first type of
- 01:10:56communication and InfiniBand environment
- 01:10:58is the unicast communication when I talk
- 01:11:02with Gabriel only two of us of us this
- 01:11:05is unicast when we talk here in the room
- 01:11:08and everybody heals this is what when I
- 01:11:15send information to a group of listeners
- 01:11:19it is is it unicast what is it it is
- 01:11:25multicast of course somebody will ask
- 01:11:28yeah but it's like a broadcast actually
- 01:11:30but in InfiniBand we do not have
- 01:11:33broadcast InfiniBand we have only
- 01:11:37multicast and the simulation if you
- 01:11:41would like to do like a broadcast you
- 01:11:43will actually create a multicast group
- 01:11:45that will include all the participants
- 01:11:49so InfiniBand there is no broadcast only
- 01:11:54multicast so if you talk about the
- 01:11:57specific addresses you have a unicast
- 01:12:00address range which
- 01:12:01is between 1/2 bfff and you have a
- 01:12:04multicast range which is between cw-1 to
- 01:12:08ffff II these are the lead addresses who
- 01:12:14provides the leads subnet manager okay
- 01:12:22the next element that I wanted to show
- 01:12:25you is that so as I was saying we have
- 01:12:29only one subnet in your organization
- 01:12:33most of you one of the questions will be
- 01:12:37and it might come from you so what if we
- 01:12:41would like if we would like to have
- 01:12:44different departments or different
- 01:12:47customers in our organization in our
- 01:12:51network how can we do that in InfiniBand
- 01:12:55we know how to do it in ethernet how do
- 01:12:58we allow different departments to have
- 01:13:02the different kind of environment in
- 01:13:04InfiniBand we have only one subject can
- 01:13:10I ask if some of you in his environment
- 01:13:12is such a need or thought of it such a
- 01:13:17need who knows what the solution is well
- 01:13:22since um there is no answer I will
- 01:13:25answer myself I will ask an answer so
- 01:13:29the answer will be partitioning
- 01:13:34partitioning is the way that you can
- 01:13:37actually implement in your environment
- 01:13:39on your InfiniBand fabric in order to
- 01:13:43have to define different partitions for
- 01:13:47different customers you may have the red
- 01:13:51partition and the green partition and
- 01:13:53the blue partition which all belonged to
- 01:13:56the same subnet but their packets have
- 01:14:01different color actually it's called
- 01:14:04different identifiers and that specific
- 01:14:08identifier is called partition
- 01:14:13partition key okay is actually the
- 01:14:18identifier of every different partition
- 01:14:22so now you can say I have the storage
- 01:14:26guys I have the openmpi guys I have
- 01:14:31other type of guys IP over IB guys and I
- 01:14:36would like to hit two to have them on a
- 01:14:39separated environment although all of us
- 01:14:43belong to the same subnet that is called
- 01:14:48partitioning so those dead partitioning
- 01:14:55can help us to provide different
- 01:14:58environment to different applications or
- 01:15:03different environment because of
- 01:15:05security purposes because for example in
- 01:15:08your Institute you have two researchers
- 01:15:12one of them is for University and the
- 01:15:17other one is for Boeing or Airbus and
- 01:15:22the research for Airbus the Airbus guys
- 01:15:25who are paying money for that they are
- 01:15:27telling you we wouldn't like to have our
- 01:15:30part of the network to have the packets
- 01:15:34touch with the other common people
- 01:15:37packets can you separate them the answer
- 01:15:41is yes partitioning
- 01:15:46the packets of the red environment will
- 01:15:49have packets or partition key ID - and
- 01:15:53the green environment will have a
- 01:15:55partition kd3 this is not the only thing
- 01:16:01the other thing that you can have with
- 01:16:04partitioning is quality of service which
- 01:16:07could be different could be different
- 01:16:11it's not the only tool but is one of the
- 01:16:14tools that will allow you to provide
- 01:16:16different quality of service to open MPI
- 01:16:20and other quality of service - GPFS why
- 01:16:26because I might put them I might place
- 01:16:30them on different environment or
- 01:16:32different partition yeah okay let's stop
- 01:16:40there let's stop here we'll have some
- 01:16:43coffee break and we will continue in 15
- 01:16:4715 minutes
- InfiniBand
- Mellanox
- HPC
- RDMA
- Subnet Manager
- training
- Leaf and Spine
- network performance
- data rate