Averaging multiple audio clips?

Thread

29th Oct 2016 20:34 #1
Aludin

View Profile

View Forum Posts

Private Message
Member

Join Date
Oct 2016
This principle works with video clips to create a high quality output from multiple bad quality sources but is this feasible with audio? I tried aligning and mixing 10 audio clips which was a failure. I thought I aligned them pretty well but I guess I didn't or this idea just doesn't work with audio, period. What do you all think?

Quote
30th Oct 2016 00:51 #2
Cornucopia

View Profile

View Forum Posts

Private Message
Member

Join Date
Oct 2001

Location
Deep in the Heart of Texas
Align how? In multitrack editor, with sample accuracy? If not, and if not already digital sources, you are wasting your time (likely making it worse with phasing/comb-filtering effects).

Scott

Quote
30th Oct 2016 05:11 #3
Aludin

View Profile

View Forum Posts

Private Message
Member

Join Date
Oct 2016
That's right, but it's hard to get it to exact sample accuracy. They are intro clips from a TV show and I don't know how they were captured. The results did have phasing and comb-filtering effects, yes. Is it safe to say this is a useless endeavor at this point?

Quote
30th Oct 2016 10:30 #4
raffriff42

View Profile

View Forum Posts

Private Message
Member

Join Date
Jun 2013

Location
USA
Depending on the problem, try Audacity's noise reducer and maybe a little equalization.
There's lots of guides out there:
https://www.google.com/search?q=salvaging+bad+audio+audacity

Quote
30th Oct 2016 16:34 #5
Cornucopia

View Profile

View Forum Posts

Private Message
Member

Join Date
Oct 2001

Location
Deep in the Heart of Texas
Originally Posted by Aludin

That's right, but it's hard to get it to exact sample accuracy. They are intro clips from a TV show and I don't know how they were captured. The results did have phasing and comb-filtering effects, yes. Is it safe to say this is a useless endeavor at this point?

Going about it that way & not controlling the provenance, YES it is.

Scott

Quote
31st Oct 2016 03:28 #6
pandy

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2008
Originally Posted by Aludin

This principle works with video clips to create a high quality output from multiple bad quality sources but is this feasible with audio? I tried aligning and mixing 10 audio clips which was a failure. I thought I aligned them pretty well but I guess I didn't or this idea just doesn't work with audio, period. What do you all think?

You can do such things in spatial (time) domain but only to reduce noise from preamp and ADC in theory doubling number of ADC will reduce uncorrelated noise by 3dB. To perform this you need to use correlated (simultaneous) sampling (practically this is quite limited in audio to one case - you have mono recording but stereo acquisition, you can split signal in analog domain and later combine in digital domain - noise floor will be efficiently reduced by max 3dB - same can be achieved by oversampling) - this approach is sometimes used in analog path or in RF DSP processing (stacking multichannel ADC).

To do multiple pass averaging you need to switch to frequency domain and perform spectral convolution - theoretically this is possible and used practically in statistical DSP for noise reduction however i'm not aware of any ready to use audio specialized application that can perform such thing - you can try to do this in one of DSP/Math application (Matlab or one of it's free alternative e.g. Scilab/Octave) but probably complexity will be very high and efficiency far from theoretical (i assume multiple window overlap, envelope detection, also time stretch as if source is mechanical you will suffer from serious jitter).

http://stackoverflow.com/questions/24609810/matlab-averaging-multiple-ffts-coherent-in...ation#24616591

As you can see this is not trivial and if you succeed then i assume you should start on business around this and Phd dissertation also should be not a problem.

Quote
31st Oct 2016 11:43 #7
Aludin

View Profile

View Forum Posts

Private Message
Member

Join Date
Oct 2016
Thanks pandy for the info. It's not just about noise reduction but to increase the quality in general because the source I have is 128 kb/s MP3. I want to get rid of possible flanging and other DCT artifacts.

Originally Posted by Cornucopia

Going about it that way & not controlling the provenance, YES it is.

If I could get it aligned perfectly to the sample, would it then work? I did this before a few years ago where I'm convinced I got them aligned very well for 13 clips but the result still sucked so I never entertained the idea again until now.

Quote
1st Nov 2016 02:36 #8
pandy

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2008
Originally Posted by Aludin

Thanks pandy for the info. It's not just about noise reduction but to increase the quality in general because the source I have is 128 kb/s MP3. I want to get rid of possible flanging and other DCT artifacts.

You can't reduce correlated noise or distortions with this method - only uncorrelated noise/distortions can be reduced - if your source is digital then this will not work.

Originally Posted by Aludin

Originally Posted by Cornucopia

Going about it that way & not controlling the provenance, YES it is.

If I could get it aligned perfectly to the sample, would it then work? I did this before a few years ago where I'm convinced I got them aligned very well for 13 clips but the result still sucked so I never entertained the idea again until now.

You must align all samples... even small misalignment will ruin signal.

Quote
1st Nov 2016 10:46 #9
Aludin

View Profile

View Forum Posts

Private Message
Member

Join Date
Oct 2016
What do you mean by correlated? When the problem is in the same spot in all clips?

Originally Posted by pandy

You must align all samples... even small misalignment will ruin signal.

Here's the problem:
https://postimg.org/image/tfs6aqvgf/
https://postimg.org/image/9q31p4kc1/

Does that look aligned? It does to me but there's no real way to know if it's perfect. All 30 clips sound the same but they obviously aren't on the micro level.

Quote
1st Nov 2016 11:27 #10
pandy

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2008
Originally Posted by Aludin

What do you mean by correlated? When the problem is in the same spot in all clips?

Yes, when problem is correlated with signal itself (i.e. also artifacts from signal processing) - multiple averaging can't fix such issues - averaging is statistical signal processing and it's main assumption is that useful signal will be not averaged but random distortions will be averaged.

Originally Posted by Aludin

Originally Posted by pandy

You must align all samples... even small misalignment will ruin signal.

Here's the problem:
https://postimg.org/image/tfs6aqvgf/
https://postimg.org/image/9q31p4kc1/

Does that look aligned? It does to me but there's no real way to know if it's perfect. All 30 clips sound the same but they obviously aren't on the micro level.

It may look aligned but every sample ( from first to last) must be aligned - even shifting by single sample may lead to signal distortions - imagine situation like samples from first signal have this pattern: 1,-1,1,-1,1,-1,1,-1,1,-1,1,-1 now you average this with different signal with same pattern but shifted by single sample -1,1,-1,1,-1,1,-1,1,-1,1,-1 - end result will be sequence of zeroes... - such pattern is nothing unusual as it representation for sine (in fact cosine) with frequency equal half of sample rate.
So micro level is crucial...

Quote
1st Nov 2016 12:15 #11
Aludin

View Profile

View Forum Posts

Private Message
Member

Join Date
Oct 2016
That's the thing, I don't think the problems are correlated. All were encoded at 128 MP3 but they're differently aligned so the distortion will always be different with not many correlated parts. The principle works with video where the only distortion that survived was blocking patterns on flat areas but all uniform and intermittent noise is gone.

It may look aligned but every sample ( from first to last) must be aligned

Are you saying that if I align it perfectly at 00:00 that there will be a misalignment a few seconds later?
So far I mixed 12 out of 30 clips (which took hours altogether) and the result is a boomy bass and a quieter treble. A spectrograph shows this anomaly isn't consistent. At one part, anything above 3 khz is visibly quieter and this slowly goes up to 8 khz over time and then goes back down. Actually, it's even more complicated. In the first 5 seconds, 1800-2500 khz is significantly quieter than the rest of the bands.

Is this an obvious symptom of mixing misaligned clips?

Quote
1st Nov 2016 16:37 #12
netmask56

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2005

Location
Sydney, Australia
Sorry to say this, I wish there was a simple way to do what you want to do, but an audio signal is in many ways far more complex than a video signal. You have an instrument playing a fundamental of say middle C - on top of that there will be a multitude of related harmonics and some of those harmonics will be shared by other instruments - apart from time differences there are phase differences and amplitude differences etc so trying to align all these variables is a huge task if not impossible with easily obtainable equipment and software - I fear your quest is in the same category of 'tilting at windmills'.

SONY 75" Full array 200Hz LED TV, Yamaha A1070 amp, Zidoo UHD3000, BeyonWiz PVR V2 (Enigma2 clone), Chromecast, Windows 11 Professional, QNAP NAS TS851

Quote
1st Nov 2016 17:07 #13
Aludin

View Profile

View Forum Posts

Private Message
Member

Join Date
Oct 2016
I fear we might be miscommunicating. What you describe sounds like the same orchestra performed two separate times and mixed together. Of course that would be fruitless to combine as the pitches and lengths of the notes will be slightly different. This isn't the case here. The intro clips I have are the same but how they were transmitted and captured I have no idea.
Just wanna make sure we're on the same page.

Quote
1st Nov 2016 17:37 #14
netmask56

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2005

Location
Sydney, Australia
Yes I understood that. I still think it is a fruitless exercise except maybe if you access to a large audio research lab with really high end equipment. As you ended your post "..but I guess I didn't or this idea just doesn't work with audio, period...." Absolutely chalk and cheese

SONY 75" Full array 200Hz LED TV, Yamaha A1070 amp, Zidoo UHD3000, BeyonWiz PVR V2 (Enigma2 clone), Chromecast, Windows 11 Professional, QNAP NAS TS851

Quote
1st Nov 2016 17:41 #15
netmask56

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2005

Location
Sydney, Australia
Take a look at these - might trigger a thought? https://www.izotope.com/en/store/deals.html?

SONY 75" Full array 200Hz LED TV, Yamaha A1070 amp, Zidoo UHD3000, BeyonWiz PVR V2 (Enigma2 clone), Chromecast, Windows 11 Professional, QNAP NAS TS851

Quote
1st Nov 2016 17:56 #16
pandy

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2008
Signal is signal and audio is not more complex than video (in fact audio is 1D signal where TV is 3D - 2D for XY and time axis is third dimension).
Video is easier to average as frame are relatively small (for example 720x576) but video will suffer same problems as audio - stacking frames not aligned will lead to distortions etc.

Described symptoms are clear proof that audio sources are different - there may be different problems responsible for this. Loosing trebles (my example is special case for treble) and increasing low are consistent with theory.
Why this is variable across time - beat frequencies https://en.wikipedia.org/wiki/Beat_%28acoustics%29 - two signals slightly different in phase will mix together giving as result unwanted modulation.
Spectral averaging may partially reduce this problem but still such distortions will be not completely removed - in real analog to digital conversion multiple AD converters are used but sampling clock is same and it must be equalized for all ADC to sometimes sub-nanosecond accuracy.

Quote
1st Nov 2016 18:03 #17
raffriff42

View Profile

View Forum Posts

Private Message
Member

Join Date
Jun 2013

Location
USA
Flanging (that psychedelic whooshing sound) originally came from two analog tape recorders trying to play back the same sound.

If you record the same sound source with two microphones placed at slightly different distances, you get a hollow sound.

My point being that adding a sound with itself is extremely sensitive to slight delays. By slight, I mean <1 msec.

Perform this test:
Invert the phase of one source. In Audacity, it's the Inverter filter.

Add (mix) the two sounds together.

Play with level, delay and speed to get the maximum cancellation - this is easy to hear as the "sweet spot" will be very noticeable.

When you've done this, the sounds are as perfectly sync'ed as they can be.

Now remove the phase inversion and listen to the final output.

If the maximum cancellation was not very deep (louder than -10dB compared to the source) the output will probably sound bad.

If the maximum cancellation was pretty good, and what remained was mostly noise or reverberation, there a chance of a good result.
Quote
1st Nov 2016 18:35 #18
Aludin

View Profile

View Forum Posts

Private Message
Member

Join Date
Oct 2016
pandy I agree that audio is less complex than video but it's for this reason that makes it more complex because you have a lot less to work with. It makes everything a lot more annoying. I don't get why mixing these clips would give a spectral phase effect tho, shouldn't it be a chorus effect if misaligning clips were mixed?

raffriff42, if the two sources were perfectly aligned, mixed together and then normalized, why would it sound worse than the worse of the two? Wouldn't it be something inbetween the good mix and bad mix?

Anyway, I've finally mixed all 30 clips where I've worked very hard to make sure it was aligned properly. I did drop 3 or 4 clips because I couldn't find any focal point to align them so it's about a total of 26 clips.

Screenshots of the spectrographs of the first clip and the final mix:
https://postimg.org/image/wwmnnrzm5/
https://postimg.org/image/y1qlk8ctp/

The final output has a very slow phasing effect (loud bass and quiet treble) but no smearing or chorus effect. I'll see if I can selectively fix the spectral problems and post the final result. Right now this is just for fun because this is way too annoying and tedious to be practical.

Quote
1st Nov 2016 18:46 #19
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
How did you "mix" them ? ie. What kind of "average" ? In video or image processing, median average is used, not a mean average. Otherwise the noise or spurious signals get mixed in, not out. In general, median will increase signal to noise ratio. Mean will reduce signal to noise ratio

Quote
1st Nov 2016 19:50 #20
raffriff42

View Profile

View Forum Posts

Private Message
Member

Join Date
Jun 2013

Location
USA
The ear is highly sensitive to short delays and small phase changes - that's how it can tell where a sound is coming from.

The eye is very insensitive to fast changes - that's why 24fps movies fool the eye into seeing motion.

This has been known for a long time, since at least 1912.

The two must be treated very differently.

Quote
1st Nov 2016 20:32 #21
Aludin

View Profile

View Forum Posts

Private Message
Member

Join Date
Oct 2016
Originally Posted by poisondeathray

How did you "mix" them ? ie. What kind of "average" ? In video or image processing, median average is used, not a mean average. Otherwise the noise or spurious signals get mixed in, not out. In general, median will increase signal to noise ratio. Mean will reduce signal to noise ratio

I mixed each one on top of the first one. The wave is a 32-bit float. After all 30 were piled on top, I normalized the waveform so it's the same loudness as any of the individuals. I'm not sure how they mix, probably a median average like you said. For video I use Merge() with RGB32.

Originally Posted by raffriff42

The ear is highly sensitive to short delays and small phase changes - that's how it can tell where a sound is coming from.

But would this still happen if I aligned them perfectly? I don't see why there would be short delays if I did.

Quote
1st Nov 2016 20:55 #22
raffriff42

View Profile

View Forum Posts

Private Message
Member

Join Date
Jun 2013

Location
USA
> But would this still happen if I aligned them perfectly?
It might work, yes. See my post #17.

But I thought you said,
> They are intro clips from a TV show and I don't know how they were captured.
You're not likely to get perfect alignment even with two captures from the same broadcast, much less with two DVDs, and forget about VCRs.

Have you tried Audacity's noise reduction as I suggested in post #4? Even with the default settings, it produces amazing results IMHO.

Quote
1st Nov 2016 21:22 #23
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
Originally Posted by Aludin

Originally Posted by poisondeathray

How did you "mix" them ? ie. What kind of "average" ? In video or image processing, median average is used, not a mean average. Otherwise the noise or spurious signals get mixed in, not out. In general, median will increase signal to noise ratio. Mean will reduce signal to noise ratio

I mixed each one on top of the first one. The wave is a 32-bit float. After all 30 were piled on top, I normalized the waveform so it's the same loudness as any of the individuals. I'm not sure how they mix, probably a median average like you said. For video I use Merge() with RGB32.

Merge() in avisynth is a weighted mean average, not a median. If you had videos with random distribution of "noise", the signal to noise ratio would usually decrease, not increase

This is an oversimplification, but a mean average would pollute the signal, because the "bad" signals are incorporated into the mean average. e.g if I take 2 identical lossless tracks, 1 silent track, and "mix" them, a median would give back the identical lossless track. Mean average would not.

The median plugin by ajk in avisynth works for aligned video
https://forum.videohelp.com/threads/362361-Median%28%29-plugin-for-Avisynth

This describes the photography case, but it's analgous for video.
e.g.
http://petapixel.com/2013/05/29/a-look-at-reducing-noise-in-photographs-using-median-blending/

I don't know how to do it for audio

Quote
1st Nov 2016 23:31 #24
Aludin

View Profile

View Forum Posts

Private Message
Member

Join Date
Oct 2016
raffriff42, I was asking in theory since I've never done this until now. No, Audacity's noise remover blows. I have better tools. Noise isn't a huge problem, I just wanted to see if there was benefit from blending audio clips together.

poisondeathray, for video the noise would decrease temporally. It isn't in the same spot twice, the sum becomes zero. Signal tends to stay in the same spot which is why it's preserved.

You're right that mixing a silent track wouldn't add anything but mixing a track with the first millisecond having signal but the rest being silent, it would affect the mix because the one moment of loud signal would be added to the pile while affecting nothing for the rest of the track, so that one small part would be visibly louder than the rest in the final mix. So not sure if this counts as a mean or median mix.

I'll try the median plugin for video.

Anyway, I've fixed the spectral phasing problems with an EQ and the output sounds good now, I can't really tell it apart from the individual clips. They sound the same. However, I'm not satisfied with the result. Looking at it closely on spectrographic view, fine details have been eroded. Some tiny glitches here and there were fixed but it doesn't outweigh the problems it introduced nor the hours it took to undertake this experiment to have nothing to show for it.

Quote
2nd Nov 2016 02:49 #25
pandy

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2008
Originally Posted by Aludin

pandy I agree that audio is less complex than video but it's for this reason that makes it more complex because you have a lot less to work with. It makes everything a lot more annoying. I don't get why mixing these clips would give a spectral phase effect tho, shouldn't it be a chorus effect if misaligning clips were mixed?

I assume source was different for those files... or not? Lossy compression is also not transparent and depends from codec implementation audio signal may deteriorate in various ways.
Once again - for doing spatial (time) domain average your samples must be aligned perfectly and keep in sync till last sample - practically this can be achieved only locally when sample clock is shared by converters.
Averaging assume two things - your signal is periodical (in your case it will be one period) and distortions must be uncorrelated with signal - it is not the case for lossy compression where distortions are highly correlated with signal.

Quote
2nd Nov 2016 16:13 #26
Aludin

View Profile

View Forum Posts

Private Message
Member

Join Date
Oct 2016
poisondeathray, I tried the Median filter and the results were worse than Merge. See my post in the thread.

I assume source was different for those files... or not?

They were all AVIs with MP3 audio. It turns out I was still not aligning them perfectly. I did a trial and error attempt now and at one point the mix was successful with no phasing problems. So it turns out I was off by one or two samples in most cases and this is after spending 20 minutes with each clip making sure it looked aligned. So yeah, not worth it.

Quote
2nd Nov 2016 21:24 #27
netmask56

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2005

Location
Sydney, Australia
A modern take.....

Attached Thumbnails

SONY 75" Full array 200Hz LED TV, Yamaha A1070 amp, Zidoo UHD3000, BeyonWiz PVR V2 (Enigma2 clone), Chromecast, Windows 11 Professional, QNAP NAS TS851

Quote

Averaging multiple audio clips?

Thread Tools

Similar Threads

Aviutl question - grabbing multiple clips

set the same duration of multiple clips in premiere?

Where should one store multiple video clips?

BDMV with multiple clips

Cuting Multiple Clips from a video file