This principle works with video clips to create a high quality output from multiple bad quality sources but is this feasible with audio? I tried aligning and mixing 10 audio clips which was a failure. I thought I aligned them pretty well but I guess I didn't or this idea just doesn't work with audio, period. What do you all think?
+ Reply to Thread
Results 1 to 27 of 27
Align how? In multitrack editor, with sample accuracy? If not, and if not already digital sources, you are wasting your time (likely making it worse with phasing/comb-filtering effects).
That's right, but it's hard to get it to exact sample accuracy. They are intro clips from a TV show and I don't know how they were captured. The results did have phasing and comb-filtering effects, yes. Is it safe to say this is a useless endeavor at this point?
To do multiple pass averaging you need to switch to frequency domain and perform spectral convolution - theoretically this is possible and used practically in statistical DSP for noise reduction however i'm not aware of any ready to use audio specialized application that can perform such thing - you can try to do this in one of DSP/Math application (Matlab or one of it's free alternative e.g. Scilab/Octave) but probably complexity will be very high and efficiency far from theoretical (i assume multiple window overlap, envelope detection, also time stretch as if source is mechanical you will suffer from serious jitter).
As you can see this is not trivial and if you succeed then i assume you should start on business around this and Phd dissertation also should be not a problem.
Thanks pandy for the info. It's not just about noise reduction but to increase the quality in general because the source I have is 128 kb/s MP3. I want to get rid of possible flanging and other DCT artifacts.
Originally Posted by Cornucopia
What do you mean by correlated? When the problem is in the same spot in all clips?
Does that look aligned? It does to me but there's no real way to know if it's perfect. All 30 clips sound the same but they obviously aren't on the micro level.
So micro level is crucial...
That's the thing, I don't think the problems are correlated. All were encoded at 128 MP3 but they're differently aligned so the distortion will always be different with not many correlated parts. The principle works with video where the only distortion that survived was blocking patterns on flat areas but all uniform and intermittent noise is gone.
It may look aligned but every sample ( from first to last) must be aligned
So far I mixed 12 out of 30 clips (which took hours altogether) and the result is a boomy bass and a quieter treble. A spectrograph shows this anomaly isn't consistent. At one part, anything above 3 khz is visibly quieter and this slowly goes up to 8 khz over time and then goes back down. Actually, it's even more complicated. In the first 5 seconds, 1800-2500 khz is significantly quieter than the rest of the bands.
Is this an obvious symptom of mixing misaligned clips?
Sorry to say this, I wish there was a simple way to do what you want to do, but an audio signal is in many ways far more complex than a video signal. You have an instrument playing a fundamental of say middle C - on top of that there will be a multitude of related harmonics and some of those harmonics will be shared by other instruments - apart from time differences there are phase differences and amplitude differences etc so trying to align all these variables is a huge task if not impossible with easily obtainable equipment and software - I fear your quest is in the same category of 'tilting at windmills'.BeyonWiz T3 PVR ~ Popcorn A-500 ~ Samsung ES8000 65" LED TV ~ Windows 7 64bit ~ Yamaha RX-A1070 ~ QnapTS851-4G
I fear we might be miscommunicating. What you describe sounds like the same orchestra performed two separate times and mixed together. Of course that would be fruitless to combine as the pitches and lengths of the notes will be slightly different. This isn't the case here. The intro clips I have are the same but how they were transmitted and captured I have no idea.
Just wanna make sure we're on the same page.
Yes I understood that. I still think it is a fruitless exercise except maybe if you access to a large audio research lab with really high end equipment. As you ended your post "..but I guess I didn't or this idea just doesn't work with audio, period...." Absolutely chalk and cheeseBeyonWiz T3 PVR ~ Popcorn A-500 ~ Samsung ES8000 65" LED TV ~ Windows 7 64bit ~ Yamaha RX-A1070 ~ QnapTS851-4G
Signal is signal and audio is not more complex than video (in fact audio is 1D signal where TV is 3D - 2D for XY and time axis is third dimension).
Video is easier to average as frame are relatively small (for example 720x576) but video will suffer same problems as audio - stacking frames not aligned will lead to distortions etc.
Described symptoms are clear proof that audio sources are different - there may be different problems responsible for this. Loosing trebles (my example is special case for treble) and increasing low are consistent with theory.
Why this is variable across time - beat frequencies https://en.wikipedia.org/wiki/Beat_%28acoustics%29 - two signals slightly different in phase will mix together giving as result unwanted modulation.
Spectral averaging may partially reduce this problem but still such distortions will be not completely removed - in real analog to digital conversion multiple AD converters are used but sampling clock is same and it must be equalized for all ADC to sometimes sub-nanosecond accuracy.
Flanging (that psychedelic whooshing sound) originally came from two analog tape recorders trying to play back the same sound.
If you record the same sound source with two microphones placed at slightly different distances, you get a hollow sound.
My point being that adding a sound with itself is extremely sensitive to slight delays. By slight, I mean <1 msec.
Perform this test:
- Invert the phase of one source. In Audacity, it's the Inverter filter.
- Add (mix) the two sounds together.
- Play with level, delay and speed to get the maximum cancellation - this is easy to hear as the "sweet spot" will be very noticeable.
- When you've done this, the sounds are as perfectly sync'ed as they can be.
- Now remove the phase inversion and listen to the final output.
- If the maximum cancellation was not very deep (louder than -10dB compared to the source) the output will probably sound bad.
- If the maximum cancellation was pretty good, and what remained was mostly noise or reverberation, there a chance of a good result.
pandy I agree that audio is less complex than video but it's for this reason that makes it more complex because you have a lot less to work with. It makes everything a lot more annoying. I don't get why mixing these clips would give a spectral phase effect tho, shouldn't it be a chorus effect if misaligning clips were mixed?
raffriff42, if the two sources were perfectly aligned, mixed together and then normalized, why would it sound worse than the worse of the two? Wouldn't it be something inbetween the good mix and bad mix?
Anyway, I've finally mixed all 30 clips where I've worked very hard to make sure it was aligned properly. I did drop 3 or 4 clips because I couldn't find any focal point to align them so it's about a total of 26 clips.
Screenshots of the spectrographs of the first clip and the final mix:
The final output has a very slow phasing effect (loud bass and quiet treble) but no smearing or chorus effect. I'll see if I can selectively fix the spectral problems and post the final result. Right now this is just for fun because this is way too annoying and tedious to be practical.
How did you "mix" them ? ie. What kind of "average" ? In video or image processing, median average is used, not a mean average. Otherwise the noise or spurious signals get mixed in, not out. In general, median will increase signal to noise ratio. Mean will reduce signal to noise ratio
The ear is highly sensitive to short delays and small phase changes - that's how it can tell where a sound is coming from.
The eye is very insensitive to fast changes - that's why 24fps movies fool the eye into seeing motion.
This has been known for a long time, since at least 1912.
The two must be treated very differently.
> But would this still happen if I aligned them perfectly?
It might work, yes. See my post #17.
But I thought you said,
> They are intro clips from a TV show and I don't know how they were captured.
You're not likely to get perfect alignment even with two captures from the same broadcast, much less with two DVDs, and forget about VCRs.
Have you tried Audacity's noise reduction as I suggested in post #4? Even with the default settings, it produces amazing results IMHO.
avisynth is a weighted mean average, not a median. If you had videos with random distribution of "noise", the signal to noise ratio would usually decrease, not increase
This is an oversimplification, but a mean average would pollute the signal, because the "bad" signals are incorporated into the mean average. e.g if I take 2 identical lossless tracks, 1 silent track, and "mix" them, a median would give back the identical lossless track. Mean average would not.
The median plugin by ajk in avisynth works for aligned video
This describes the photography case, but it's analgous for video.
I don't know how to do it for audio
raffriff42, I was asking in theory since I've never done this until now. No, Audacity's noise remover blows. I have better tools. Noise isn't a huge problem, I just wanted to see if there was benefit from blending audio clips together.
poisondeathray, for video the noise would decrease temporally. It isn't in the same spot twice, the sum becomes zero. Signal tends to stay in the same spot which is why it's preserved.
You're right that mixing a silent track wouldn't add anything but mixing a track with the first millisecond having signal but the rest being silent, it would affect the mix because the one moment of loud signal would be added to the pile while affecting nothing for the rest of the track, so that one small part would be visibly louder than the rest in the final mix. So not sure if this counts as a mean or median mix.
I'll try the median plugin for video.
Anyway, I've fixed the spectral phasing problems with an EQ and the output sounds good now, I can't really tell it apart from the individual clips. They sound the same. However, I'm not satisfied with the result. Looking at it closely on spectrographic view, fine details have been eroded. Some tiny glitches here and there were fixed but it doesn't outweigh the problems it introduced nor the hours it took to undertake this experiment to have nothing to show for it.
Once again - for doing spatial (time) domain average your samples must be aligned perfectly and keep in sync till last sample - practically this can be achieved only locally when sample clock is shared by converters.
Averaging assume two things - your signal is periodical (in your case it will be one period) and distortions must be uncorrelated with signal - it is not the case for lossy compression where distortions are highly correlated with signal.
poisondeathray, I tried the Median filter and the results were worse than Merge. See my post in the thread.
I assume source was different for those files... or not?
A modern take.....BeyonWiz T3 PVR ~ Popcorn A-500 ~ Samsung ES8000 65" LED TV ~ Windows 7 64bit ~ Yamaha RX-A1070 ~ QnapTS851-4G