00:00:00
So today we're diving into a fascinating
00:00:02
study that could change about how we
00:00:03
think about AI in medicine researchers
00:00:05
tested one of the newest AI systems
00:00:07
called 01 preview against both human
00:00:10
doctors and previous models like GPT 4
00:00:13
just to see how good AI has become at
00:00:16
medical diagnosis and decision-making
00:00:18
now this wasn't a simple test the
00:00:20
researchers put the AI through four or
00:00:23
five different intense challenges
00:00:24
ranging from diagnosing complex medical
00:00:27
cases that have stumped doctors to
00:00:28
suggesting treatment plans to
00:00:30
identifying critical conditions that
00:00:32
absolutely can't be missed they used
00:00:34
real medical cases from the prestigious
00:00:36
New England Journal of Medicine these
00:00:39
are the kind of complex cases that even
00:00:41
experienced doctors find challenging and
00:00:43
what makes this study particularly
00:00:45
interesting is that they didn't just use
00:00:47
multiple choice questions instead they
00:00:49
tested the abilities of the AI to think
00:00:52
and reason like a doctor would in real
00:00:54
world scenarios because they wanted to
00:00:56
see if the AI could handle complex
00:00:59
multi-step thinking that doctors use
00:01:01
every day when treating patients and in
00:01:03
this video I'll break down exactly what
00:01:05
they tested what they found and what
00:01:07
this could mean for the future of
00:01:08
healthcare now one of the things that
00:01:10
they found in this research was the fact
00:01:12
that this AI model the open ai1 model
00:01:16
was actually really really impressive in
00:01:19
comparison to GPT 4 they actually
00:01:21
showcase around three different cases
00:01:24
where GPT 4 cannot solve a complex case
00:01:28
it can't diagnose it and and it manages
00:01:31
to get it completely wrong whereas 01
00:01:34
gets the diagnosis completely right so
00:01:37
case one they had some really complex
00:01:39
disease GPT 4 got it completely wrong
00:01:42
with a bond score of zero then 01
00:01:44
preview managed to get it completely
00:01:46
right and identified the exact condition
00:01:49
case number two there was another
00:01:50
complex task which GPT 4 completely
00:01:53
missed and listed common conditions
00:01:55
instead then 01 preview completely
00:01:57
nailed it and got the rare condition
00:02:00
completely right then we had case three
00:02:02
there was an actual condition and GPT 4
00:02:04
was close which you know managed to get
00:02:06
a bond score of three and listed some
00:02:09
correct information but incorrect
00:02:11
conditions whereas 01 preview got it
00:02:14
exactly right again and in this what's
00:02:17
particularly interesting is that the
00:02:18
bond score shows how close each AI got
00:02:21
zero is completely wrong five is exactly
00:02:24
right and these were actually really
00:02:25
tough cases so these were like medical
00:02:27
Mysteries and GPT 4 tended to guess more
00:02:30
common conditions but 01 preview was
00:02:32
able to identify rare and complex
00:02:35
conditions pretty accurately and this
00:02:36
just basically shows us that with each
00:02:38
Improvement of AI and of course with
00:02:40
this new series of models whilst yes you
00:02:42
might use this AI on a day-to-day basis
00:02:45
when we are tackling complex scenarios
00:02:47
like this this is where these thinking
00:02:50
models really do shine now there was
00:02:53
also this image right here and this
00:02:55
image shows a comparison of how well
00:02:57
different diagnostic systems both Ai and
00:03:00
human perform at correctly diagnosing
00:03:02
medical conditions using cases from the
00:03:04
New England Journal of Medicine and this
00:03:06
is from 2012 to 20 so now the types of
00:03:09
systems showns in the blue colors are of
00:03:12
course the modern AI systems and the
00:03:14
light blue is where you have the older
00:03:16
diagnostic systems that required doctors
00:03:18
to manually input symptoms and of course
00:03:20
in the brown bar at the bottom that is
00:03:23
where you can see the human clinicians
00:03:25
performance now overall what we can see
00:03:27
here is that there is of course a Stark
00:03:29
impr Improvement when we look at the 01
00:03:32
preview compared to GPT 4 then when we
00:03:34
look at these older AI systems we can
00:03:36
see that they're not as good and of
00:03:38
course we can see compared to the
00:03:39
clinician there is a large increase in
00:03:42
terms of the percentage correct
00:03:43
diagnosis from here you can see it's
00:03:46
around 30% whereas with these llms it's
00:03:49
around 60 to above 75% which is rather
00:03:52
surprising and this really goes to show
00:03:54
us just how powerful these AI systems
00:03:57
are I know a lot of people give these
00:03:58
generative AI system system Flack
00:04:00
because oh they're just regurgitating
00:04:02
stuff but when you apply them to medical
00:04:05
use cases you can see that these tools
00:04:07
are remarkably powerful for diagnosing
00:04:09
different diseases or diagnosing
00:04:11
different things in a variety of
00:04:12
different scenarios processing complex
00:04:14
bits of medical information and arriving
00:04:16
at correct diagnosis is the kind of
00:04:18
thing that AI is exactly designed for or
00:04:21
should I say uniquely designed for now
00:04:23
we can see here figure five comparison
00:04:25
of GPT 4 01 preview and Physicians for
00:04:29
management and diagnostic reasoning and
00:04:31
we can see here that this image shows
00:04:33
how well different groups performed when
00:04:35
managing medical cases called gry
00:04:37
matters management cases comparing
00:04:39
scores between 01 preview by itself
00:04:41
which scores are remarkable 85 to 90%
00:04:44
GPT 4 AI scoring around 40 to 50% and
00:04:47
human Physicians using a GPT 4 as a tool
00:04:51
scoring around 40 to 50% and then of
00:04:53
course human Physicians using standard
00:04:56
traditional medical resources scoring
00:04:59
are whopping 30 to 40% so this is rather
00:05:03
fascinating once again the scores
00:05:04
ranging from 0 to 100 show us that 01
00:05:07
preview clearly outperformed all other
00:05:10
options by a large margin and this is
00:05:12
fascinating because this performed
00:05:14
significantly better than both GPT 4 and
00:05:18
the human Physicians interestingly there
00:05:19
wasn't much difference alone between GPT
00:05:21
4 and the Physicians using GPT 4 but
00:05:24
this visualization powerfully
00:05:25
demonstrates how much more capable 01
00:05:27
preview is at Medical Management
00:05:29
reasoning compared to both earlier AI
00:05:31
systems and human Physicians even when
00:05:34
those Physicians have access to AI or
00:05:36
traditional resource now in addition to
00:05:38
this I do want to caveat this by saying
00:05:40
this is 01 preview this isn't even the
00:05:42
full 01 nor is it even 03 which was
00:05:46
recently released by opening ey/ demode
00:05:49
and we know that that model is even
00:05:51
smarter so imagine what kinds of results
00:05:53
that would get if this preview model is
00:05:55
getting around 80 to 90% we can also see
00:05:58
this in terms of the landar Mark
00:05:59
diagnostic cases and these cases are
00:06:01
basically the greatest medical Mysteries
00:06:04
that have been solved they're like
00:06:05
famous cases that have become teaching
00:06:07
Classics in medicine kind of like the
00:06:09
greatest hits of medical diagnosis now
00:06:12
these are real patient cases from the
00:06:14
past that were particularly challenging
00:06:15
or groundbreaking they helped doctors
00:06:17
learn something new about a disease or
00:06:19
condition and they often changed how
00:06:21
doctors approach diagnosing similar
00:06:23
problems now what makes these landmark
00:06:25
cases is that they're usually complex
00:06:27
cases that weren't obvious to solve they
00:06:29
often involved unusual combinations of
00:06:31
symptoms and the final diagnosis was
00:06:34
essentially surprising or taught doctors
00:06:36
something new and they become standard
00:06:38
teaching tools in medical schools now
00:06:40
when they managed to test these AI
00:06:42
systems on this we can see once again
00:06:44
that 01 preview manages to get a
00:06:47
extremely high score on the leftand side
00:06:49
and we can see that gp4 only also
00:06:51
manages interestingly to outperform
00:06:53
Physicians with gp4 and Physicians with
00:06:56
gp4 does perform better than Physicians
00:06:58
and resources now interestingly here we
00:07:00
can see that the AI didn't manage to
00:07:02
supersede humans that much because there
00:07:04
were several cases where humans managed
00:07:06
to get this stuff but we can see here
00:07:07
that the AI is definitely really
00:07:10
effective when it does come to these
00:07:11
Landmark diagnostic cases I mean whether
00:07:13
or not you could say that this is a
00:07:14
training data thing I still think that
00:07:16
this is remarkably impressive
00:07:18
considering the Physicians are seeming
00:07:19
better off with these AR tools rather
00:07:21
than without them now this graph right
00:07:23
here shows how often different groups
00:07:25
caught the most critical diagnosis and
00:07:27
this is what they call cannot miss
00:07:29
diagnoses these are the diagnosis
00:07:31
conditions that if they are missed they
00:07:34
could be life-threatening for patients
00:07:36
so we have four different categories so
00:07:38
we got the residents in pink which are
00:07:40
junior doctors in training we've got the
00:07:42
attending physicians in green which are
00:07:44
experienced fully qualified doctors then
00:07:46
we've got gp4 in blue the previous AI
00:07:49
model and 01 preview in purple the
00:07:52
newest AI model now what the graph shows
00:07:54
is a scale that goes from 0 to 1 or 0%
00:07:57
to 100% And the boxes show where the
00:07:59
majority of the scores were and the
00:08:01
black lines show the full range of
00:08:03
different scores and of course the dots
00:08:06
show the individual results now all
00:08:08
groups perform similarly around a 50% to
00:08:11
100% rate but we can see once again that
00:08:13
01 preview was more slightly consistent
00:08:16
and residents showed more variation in
00:08:18
performance experienced doctors
00:08:20
performed about as well as these AI
00:08:22
systems and this was rather fascinating
00:08:25
because once again we see that AI
00:08:26
manages to perform really well in these
00:08:28
scenarios now let me break down this
00:08:30
table which shows how 01 preview planned
00:08:32
medical tests compared to what actually
00:08:34
happened in the case if we take a look
00:08:36
at this first case you can see you know
00:08:38
there was a certain plan which the
00:08:40
doctors actually planned and then
00:08:42
interestingly the 01 preview managed to
00:08:44
suggest another plan which was actually
00:08:47
very similar to exactly what these
00:08:49
doctors suggested so you can see here in
00:08:51
this case it managed to get a two score
00:08:54
which is a completely correct score when
00:08:55
it comes to planning certain things in
00:08:58
terms of the range of tests that you
00:09:00
would conduct when you're trying to
00:09:02
figure out what kind of diagnosis that
00:09:03
you would have now there were some
00:09:06
things here that were rather interesting
00:09:08
it was impressive that the AI didn't
00:09:10
just suggest random tests it laid out a
00:09:12
comprehensive stepbystep plan that
00:09:14
included backup plans and Alternatives
00:09:17
it explained why each test was needed
00:09:19
and it matched what expert doctors
00:09:21
actually did in real life and this was
00:09:23
rather fascinating because there are
00:09:24
complex steps that go into doing this
00:09:27
and it's important to understand that
00:09:28
all of those reasoning steps have to be
00:09:30
completed successfully for the AI to get
00:09:33
the right answer now there were certain
00:09:35
areas where the AI was wrong there were
00:09:37
two other scenarios where the AI got
00:09:38
half the answer right and then the other
00:09:40
one got completely incorrect but I think
00:09:43
the most fascinating thing about this is
00:09:45
that this is an AI system which isn't
00:09:47
just purely medically based like it
00:09:49
isn't fine-tuned on medical issues but
00:09:51
remarkably we can see that when we're
00:09:53
looking at these diagnosis we're seeing
00:09:55
these suggested plans we're seeing that
00:09:57
it's able to sometimes get the right
00:10:00
suggested plan and the right steps to
00:10:02
take which is rather impressive and we
00:10:04
can only imagine what's going to happen
00:10:05
in the next 5 years the kinds of models
00:10:08
that we're going to be get and just how
00:10:09
accurate they are in terms of diagnosing
00:10:12
conditions and of course suggesting
00:10:13
plans of course I would say though that
00:10:15
I hope humans don't become too reliant
00:10:17
on this because of course with
00:10:18
hallucinations you wouldn't want to have
00:10:20
you know a tired dentist that is
00:10:21
overworked or a tired doctor that is
00:10:23
overworked or atire clinician or
00:10:25
physician that just uses what the AI
00:10:27
says and then next thing you know a UC
00:10:29
ination manages to mess up a person so
00:10:31
of course I do think that humans will
00:10:33
always have a role to play when it comes
00:10:35
to diagnosing individuals we could also
00:10:37
see here that this individual said that
00:10:39
I had A1 analyze a very specific immune
00:10:42
disease for my friend who happens to be
00:10:44
one of the top scientists in the field
00:10:46
and After High said the results his
00:10:48
response was oh my God I just read it
00:10:50
this is breathtaking this is insanely
00:10:51
good so we can see also that the
00:10:53
qualitative results from individuals
00:10:55
using this at the top of their field
00:10:57
does seem to be one that proves that
00:10:59
these models are also rather fascinating
00:11:01
so with that being said what do you guys
00:11:03
think is the future of AI and humans
00:11:05
when it comes to the medical industry I
00:11:08
think it's really fascinating that we're
00:11:09
now starting to explore this in further
00:11:11
detail I do think that with rules and
00:11:12
regulations it's going to be pretty hard
00:11:14
to actually get these models out into a
00:11:17
real sort of practice but I do think
00:11:19
we're going to start to see more and
00:11:20
more cases where doctors may have missed
00:11:22
certain things but users taking it into
00:11:23
their own hands to consult with a model
00:11:25
like 01 or even 03 and get remarkable
00:11:28
results that doctors simply would have
00:11:30
missed this is something that I've
00:11:31
discussed before that literally millions
00:11:33
of Americans die each year because
00:11:35
doctors manag to make mistakes we will
00:11:37
make mistakes we're humans but the only
00:11:38
problem is is that in the medical
00:11:40
industry sometimes there are situations
00:11:42
that are simply life or death and those
00:11:44
mistakes do cost lies so maybe having an
00:11:46
AI System review every single decision
00:11:49
made maybe we could catch those rare
00:11:51
conditions or diseases that we otherwise
00:11:53
would have missed and then of course
00:11:55
having humans check over and run the
00:11:56
necessary test to ensure that what the
00:11:58
AI suggests Ed is potentially factual
00:12:00
with that being said would you be open
00:12:02
to having an AI doctor I personally
00:12:04
think that with the next 15 to 20 years
00:12:06
we're certainly going to have maybe some
00:12:07
pods or something where you prick your
00:12:09
finger you get an instant blood test you
00:12:11
get an AI doctor that tells you
00:12:12
everything wrong in your body you get
00:12:14
instant diagnosis you get an AI that
00:12:15
reasons over all of your personal data
00:12:17
Maybe it knows everything you've done
00:12:19
everything you've seen it knows
00:12:20
everything you've eaten and it's able to
00:12:22
condu probably the most effective plan
00:12:24
for you because it understands your
00:12:26
emotional state your physical state your
00:12:27
water levels how much you've been
00:12:28
drinking and it can probably suggest the
00:12:31
most accurate thing context is of course
00:12:33
key and I find that the more context you
00:12:34
give these models and of course your
00:12:36
doctors the better they become and if we
00:12:38
look at how AI is going to be integrated
00:12:40
into our lives I wouldn't be surprised
00:12:41
if we're going to be sharing that AI
00:12:43
data with our doctors very soon a very
00:12:45
interesting world for those of you who
00:12:47
are trying to live for other with that
00:12:49
being said if you enjoyed this video I
00:12:50
would like to see you in the next one