What is voice-to-voice technology?

Voice-to-voice technology allows the transformation of one voice into another, often using AI-based applications.

What applications does voice-to-voice technology have?

It can be used for AI voice changers, voice acting, music production, and any application requiring voice transformation.

How simple is the installation of this voice conversion web interface?

The installation process is user-friendly and works on various operating systems, including Windows, with options for a normal setup or using tools like Anaconda or Google Colab.

What is the typical training time required for this voice conversion technology?

Training can take as little as 30 minutes to create a voice model using the provided web interface.

What are the key steps in using the voice conversion interface?

Key steps include model inference, separating accompaniment from vocals, training checkpoints, and adjusting settings for optimal voice conversion.

What is the best pitch extraction method according to the video?

The Harvest method is recommended for the best quality, though it is slower compared to alternatives like PM and DIO.

Can I convert a song without having separate vocal stems?

Yes, the interface provides tools to separate vocals from background music to facilitate conversion.

What type of audio files are required for training?

Audio files should contain only vocal segments without any music background for effective training.

Is there support available for those working with the interface for the first time?

The interface includes an FAQ section, which provides detailed guidance and addresses common questions.

What are some installation tools suggested for Windows users?

Tools like 7-Zip and Anaconda, along with Google Colab, are suggested for installation on Windows platforms.

Use AI to Clone ANY Voice & Sing ANY Song for FREE | RVC WebUI Tutorial

00:12:14

https://www.youtube.com/watch?v=-JcvdDErkAU

Zusammenfassung

TLDRThis video introduces a web interface for voice-to-voice AI technology, which enables users to transform one voice into another seamlessly. The process, although previously complex, is simplified with this interface, which consolidates necessary steps like collecting voice samples, training models, and vocal separation. Users can efficiently install the application on multiple OS platforms, including Windows, with simple steps using Python, 7-Zip, and Anaconda. The video guides through the use of the interface including model inference, pitch extraction, and training configurations, aimed at achieving high-quality voice conversion. It further explains how to process audio files to ensure clean vocal training samples and offers an FAQ section for additional support. With only 30 minutes of training time, users can achieve high-quality voice conversion results, transforming not only static voices but also dynamic singing performances, making it particularly suitable for creative applications like AI singing voice conversions.

Mitbringsel

🎤 This voice conversion interface can transform voices using AI technology.
🖥️ The application is compatible with multiple operating systems, including Windows.
🛠️ Installation is simplified through Python, Anaconda, and 7-Zip tools.
⏱️ Training a voice model can take as little as 30 minutes.
🎶 Users can separate vocals from music tracks for conversion purposes.
⚙️ Features a user-friendly interface with tabs for different processing tasks.
🔎 Offers pitch extraction methods, with Harvest providing the best quality.
📁 Organizes audio training samples efficiently for voice model creation.
💾 Provides options for single and batch voice conversions.
📚 Includes an FAQ section for user guidance and troubleshooting.

Zeitleiste

00:00:00 - 00:05:00
The video introduces a new all-in-one app for voice-to-voice conversion, emphasizing its quick training process. Initially, the host explains the complexity of AI singing voice conversion, involving various stages like collecting voice samples, training models, separating vocals from music, and finally mixing them back. The host demonstrates this by using a retrieval-based voice conversion web interface, showcasing how in under 30 minutes, one's voice can be used to sing an example song from Pixabay. They then transition to showing how to set up the app on various operating systems, including Windows, and recommend using certain installation methods like Anaconda for ease of management.
00:05:00 - 00:12:14
The host guides the viewer through setting up their first voice model using the app, detailing steps like setting experiment names, sample rates, and utilizing the training directory. The app can process long audio files, requiring only individual vocal samples without background music. The host recommends settings for pitch extraction and training configuration, ensuring high-quality output using a GPU. After training, the model is ready for inference, allowing conversion of voice in songs without separate Vocal Stems. Separation of vocals and music is also discussed, with guidance on directory management and selecting the right model options. Finally, the host demonstrates using pitch settings to match voice characteristics, leading to successful voice conversion.

Mind Map

Video-Fragen und Antworten

What is voice-to-voice technology?
Voice-to-voice technology allows the transformation of one voice into another, often using AI-based applications.
What applications does voice-to-voice technology have?
It can be used for AI voice changers, voice acting, music production, and any application requiring voice transformation.
How simple is the installation of this voice conversion web interface?
The installation process is user-friendly and works on various operating systems, including Windows, with options for a normal setup or using tools like Anaconda or Google Colab.
What is the typical training time required for this voice conversion technology?
Training can take as little as 30 minutes to create a voice model using the provided web interface.
What are the key steps in using the voice conversion interface?
Key steps include model inference, separating accompaniment from vocals, training checkpoints, and adjusting settings for optimal voice conversion.
What is the best pitch extraction method according to the video?
The Harvest method is recommended for the best quality, though it is slower compared to alternatives like PM and DIO.
Can I convert a song without having separate vocal stems?
Yes, the interface provides tools to separate vocals from background music to facilitate conversion.
What type of audio files are required for training?
Audio files should contain only vocal segments without any music background for effective training.
Is there support available for those working with the interface for the first time?
The interface includes an FAQ section, which provides detailed guidance and addresses common questions.
What are some installation tools suggested for Windows users?
Tools like 7-Zip and Anaconda, along with Google Colab, are suggested for installation on Windows platforms.

Weitere Video-Zusammenfassungen anzeigen

Erhalten Sie sofortigen Zugang zu kostenlosen YouTube-Videozusammenfassungen, die von AI unterstützt werden!

Untertitel

Automatisches Blättern:

00:00:00
hello and welcome to More nerdy rodent
00:00:02
geekery voice to voice technology a in
00:00:06
case you're not aware of what this is it
00:00:08
basically allows you to change one voice
00:00:10
into another voice a bit like having an
00:00:13
AI voice changer and the best part
00:00:16
everything you need is now all in one
00:00:20
app plus it's really quick to train as
00:00:23
well so here it is the retrieval based
00:00:27
voice conversion web user interface
00:00:29
hardly a mouthful at all you may be
00:00:33
aware that AI singing voice conversion
00:00:35
can be a bit of a task as there are
00:00:38
multiple stages involved before you can
00:00:41
create your Masterpiece video with John
00:00:43
Cena dancing while listening to Abraham
00:00:46
Lincoln singing the very latest K-Pop
00:00:48
song first you need to collect a bunch
00:00:50
of voice samples process them trainer
00:00:52
model separate the vocals from the music
00:00:55
track you're changing if you don't
00:00:57
already have them separately run your
00:00:59
new AI model on those vocals and finally
00:01:02
mix them back in with the music
00:01:04
thankfully that can now all be done via
00:01:07
this one web interface and what is the
00:01:09
quality like well let's have a listen
00:01:12
I've used an example song from pixabay
00:01:14
there it is meaning that in less than 30
00:01:17
minutes of training time I can be the
00:01:19
one singing instead so let's take a
00:01:21
quick listen to a clip of the original
00:01:23
so we know what I'm going to convert
00:01:25
from
00:01:26
[Music]
00:01:35
and then now with that voice changed by
00:01:38
this AI to sound like me instead
00:01:41
[Music]
00:01:48
want to do this yourself then stick with
00:01:50
me here and I'll show you exactly how as
00:01:53
with anything python installation is an
00:01:56
absolute Breeze and the best part is
00:01:58
that it works on a range of operating
00:02:01
systems even Microsoft Windows here's a
00:02:05
little table with some of the
00:02:07
requirements
00:02:08
if you use Microsoft Windows sorry if
00:02:11
you are using that I do hope things get
00:02:13
better what you could do is download and
00:02:16
install 7-Zip download the
00:02:19
rvc-beta 7-Zip file from the hugging
00:02:22
face page unzip it and then use go
00:02:25
hyphen web.bat
00:02:28
a normal install can also be done just
00:02:31
like they have here though you may want
00:02:33
to download the 7-Zip archive anyway as
00:02:35
that has all the models in it personally
00:02:38
I did the normal install using an
00:02:40
anaconda virtual python 3.10 environment
00:02:43
as I like simple App Management there is
00:02:47
also a Google collab available if you
00:02:50
prefer to use Google collab so with
00:02:53
whatever installation method you chose
00:02:54
you should now have your web interface
00:02:56
up and running let's dive into this
00:02:58
fascinating world of voice to voice
00:03:00
technology and see what amazing things
00:03:03
we can create
00:03:06
if you already have a model you can do
00:03:08
model inference straight away or like me
00:03:11
you can begin with training one if you
00:03:14
don't there is the training tab however
00:03:16
before we delve into the training
00:03:19
process let's just quickly go over these
00:03:21
five tabs so first of all you've got
00:03:23
model inference you've got separation of
00:03:26
accompaniment and vocal train checkpoint
00:03:30
processing so you can mix checkpoints
00:03:32
together there export onnx which I've
00:03:34
never used and also an FAQ as well to
00:03:38
begin with as mentioned we're going to
00:03:40
start with the training tab as this is
00:03:42
where you will create your very first
00:03:44
voice model step one for the experiment
00:03:48
name simply enter the name you want to
00:03:50
give your project so you could do for
00:03:52
example nerdy because that's me as for
00:03:55
the sample rate I personally prefer
00:03:57
always using 40K and I always leave this
00:04:01
on true as well as that seems to be the
00:04:03
best model architecture you can select
00:04:06
either version 1 or version 2.
00:04:08
personally I prefer version two number
00:04:11
of threads I think is probably picked
00:04:13
automatically
00:04:14
congratulations you have now completed
00:04:17
step one the next step is step two a the
00:04:20
first thing it asks for here is the path
00:04:23
to the training directory if you're not
00:04:26
familiar with terms like files and
00:04:27
directories on your computer this part
00:04:29
can be quite confusing you could think
00:04:33
of directories as computer boxes where
00:04:36
you organize your things files in this
00:04:38
case and I've put them into a training
00:04:41
directory so there is my path training
00:04:44
nerd
00:04:45
if we have a quick look at that
00:04:47
directory as you can see it's absolutely
00:04:50
full of audio files
00:04:52
if your name is different you may wish
00:04:54
to use something else but it's entirely
00:04:57
up to you even though I'd already split
00:04:59
my samples up into around 250 segments
00:05:02
you don't actually need to worry too
00:05:04
much about that because this program
00:05:06
will automatically handle long audio and
00:05:10
split it accordingly generally speaking
00:05:13
between 10 and 50 minutes total audio is
00:05:16
required any vocals are fine singing
00:05:19
Talking whatever just make sure that you
00:05:21
don't have any music in the background
00:05:23
it should be all one person vocals only
00:05:28
okay so now you've put in the directory
00:05:30
with all your samples in you can just
00:05:32
click process data that will take a few
00:05:35
seconds and process all of the samples
00:05:37
for you
00:05:38
now you're ready to move on to step two
00:05:41
B if you have multiple graphics cards
00:05:44
then you can put them in here but I've
00:05:47
only got a single GPU so I just leave
00:05:50
that as is the defaults are absolutely
00:05:51
fine next you have pitch extraction
00:05:54
which has three options personally I
00:05:57
always go with Harvest PM is fast but
00:06:00
low quality do is a bit slower but
00:06:03
better quality and harvest is the
00:06:05
slowest but the best quality so with
00:06:08
Harvest selected there I just click
00:06:10
feature extraction that will take a few
00:06:12
seconds and finish that task
00:06:16
step three well here for the most part
00:06:18
you can just go ahead and click that one
00:06:20
click training button come back in about
00:06:22
10 minutes and you'll have a model
00:06:24
however if you are like me and you do
00:06:27
like to change things a little bit
00:06:28
you've got some options there for how
00:06:30
often you want to save the full model
00:06:32
the total number of epochs the GPU batch
00:06:35
size and some options for saving
00:06:38
personally the way I like to set this up
00:06:40
for a version 2 model is to set that to
00:06:43
10 total training epochs I do to 200
00:06:47
which is about the maximum you'll ever
00:06:49
need as I have a very large GPU I've got
00:06:53
24 gig of vram the batch size up to 40
00:06:55
as that's the maximum my GPU will handle
00:06:58
I like to click yes to only save the
00:07:01
latest checkpoint I'll keep cash all on
00:07:04
no and I say yes to save small finished
00:07:07
models
00:07:08
so with your model training via that one
00:07:11
click training I would suggest also
00:07:13
going and having a look over at the
00:07:15
frequently asked questions tab there's
00:07:18
quite a lot of information here
00:07:19
particularly useful are question nine
00:07:22
and question 10 how many total epochs
00:07:24
are optimal and how much training set
00:07:27
duration is needed
00:07:29
now that you've got your very first
00:07:31
voice model it's time to do that AI
00:07:34
voice to voice thing if you already have
00:07:36
the voice that you want to convert you
00:07:39
can skip straight to model inference
00:07:41
however if you want to do something like
00:07:43
change the singer of a song that you
00:07:46
don't have the Vocal Stems for like I
00:07:48
did here then you'll first need to
00:07:50
separate those vocals out from the
00:07:53
background music and this is where the
00:07:56
separation tab comes in handy
00:07:58
once again those files and directories
00:08:01
come into play as you'll need to know
00:08:03
where you've saved your music files the
00:08:06
first boxes if you want to convert
00:08:08
multiple files from a given directory as
00:08:11
I tend to do just one at a time I delete
00:08:13
that and then use the box underneath
00:08:15
instead
00:08:17
model selection has two options like it
00:08:20
says at the top there hp2 is for input
00:08:23
without Harmony or if with Harmony and
00:08:26
instructed vocals do not need harmony
00:08:28
use hp5 basically if you're unsure use
00:08:31
both have a listen to the output and see
00:08:34
which is best for you in my case I'm
00:08:37
going to use hp2 here
00:08:40
by default the output goes into the opt
00:08:43
directory so feel free to change that if
00:08:46
you like
00:08:47
when you're ready push the huge orange
00:08:49
convert button and you'll have split the
00:08:51
vocals from the music
00:08:54
let's have a quick listen to that
00:08:57
of course there's a few seconds of
00:08:58
Silence
00:09:03
[Music]
00:09:05
there we go anyway that's done quite
00:09:06
well we've got the vocals there without
00:09:08
the music
00:09:09
even if there is a little bit of an echo
00:09:12
or something there in the voice alright
00:09:14
so now we're ready to go with inference
00:09:17
the page does look huge but really it's
00:09:20
two things in one the top half there is
00:09:22
for single voice conversion and there
00:09:25
you've got a batch as well so I'll just
00:09:27
be going through the one the batch is
00:09:29
essentially the same but you're doing
00:09:30
loads at a time again everything is
00:09:33
pretty straightforward here push that
00:09:35
huge refresh button and then you should
00:09:38
see your options appear in this little
00:09:41
pull down here my list is absolutely
00:09:43
huge as all the girls would agree but
00:09:46
you'll probably only have one option in
00:09:49
there the first time so pick that I'm
00:09:52
going to pick that one because that's my
00:09:53
trained voice
00:09:54
next you have to select a pitch just
00:09:57
like it says above for low to high
00:09:59
conversion news plus 12 if it's about
00:10:01
the same use zero and for high to low
00:10:05
voice conversion use minus 12. the
00:10:08
source voice in this case is quite High
00:10:10
my voice is a bit lower so I'm going to
00:10:13
use minus 12.
00:10:15
once again those files and directories
00:10:18
come into play here so put the path to
00:10:21
your vocals in if you did that default
00:10:23
voice separation then you'll have the
00:10:25
two files in your opt directory you want
00:10:29
the one which starts vocal so there in
00:10:31
my opt directory I have the long name of
00:10:34
that WAV file the one that starts with
00:10:36
vocal for pitch extraction again PM is
00:10:39
Fast and The Harvest is best so I like
00:10:42
to select Harvest everything else I
00:10:44
leave at the default apart from this
00:10:46
path to index which should have a pull
00:10:49
down menu there is the one that I want
00:10:51
to use because it matches that inference
00:10:54
voice
00:10:54
okay so now you can go ahead and click
00:10:57
that very tiny convert button and in
00:11:00
just a few seconds you should have your
00:11:02
output
00:11:04
and there it is
00:11:07
[Music]
00:11:14
yeah there we go that's pretty cool
00:11:16
that's pretty cool that's me now you can
00:11:19
right click that save audio as I'm going
00:11:22
to put it in my opt directory as well
00:11:24
I'm using audacity here I've got the
00:11:26
instrumental so I just drag that other
00:11:29
voice in and then I can file exporters
00:11:32
whatever I want and it will mix those
00:11:34
two voices together
00:11:36
thank you
00:11:40
on your bones
00:11:47
[Music]
00:11:56
plus if you thought that was cool then
00:11:59
you may also like this nerdy rodent
00:12:01
video