Hi Veeky Forums

Hi Veeky Forums
I got a question that's been bugging me for a while.
In a gradient descent algorithm. Why having a normalized dataset helps converge faster.
The only explanation I found online is that if the contour plot for the cost function J(θ1,θ2) is kind of stretched and thin (pic related). the algorithm oscillates.
Why does it oscillate ? I don't understand why it oscillates

Other urls found in this thread:

coursera.org/learn/machine-learning/lecture/xx3Da/gradient-descent-in-practice-i-feature-scaling
twitter.com/NSFWRedditVideo

Because if you don't normalize your data then relatively small differences between data points might look huge or relatively large differences between data points might look tiny. Like the difference in adult height measured in centimeters might stay within a range of 30 in centimeters while the difference between incomes might span 100,000 in dollars, and the difference in range between those two cases is pretty arbitrary and something you want to get rid of as a consideration because what you really care about is the relative difference between your data points rather than whatever the absolute difference happens to be.

I want to understand from a mathematical point of view. Why data with a stretched contour make it oscillate. I can't grasp this
sorry if i'm being stupid

What you're trying to do is get to the center as efficiently as possible. The fact it's stretched shows you're getting results farther off from the center and that takes more time for the algorithm to get to where you want it to go. And it's returning these far off values on non-normalized data because the pattern you're trying to get it to learn is still there but it's not weighted properly which makes irrelevant happenstances of scale influence the algorithm into making worse choices at first.
You can see the difference in how obvious a trend looks by graphing the data before normalization and comparing it to a graph of the data after normalization. If the data's not normalized it might seem like a huge relative difference in data points is hardly there at all.

but the gradient descent uses the gradient at the point to somehow decide where to go next.
the fact that it is stretched just says there is a big difference between theta 1 and theta 2. I don't see why that would make converging a problem

this is how it is explained in a video course in coursera.
I don't understand why it oscillates like that

>to somehow decide where to go next
When you say somehow do you mean you aren't sure how specifically the gradient would translate into a node / connection weight change? Each gradient provides the information of how much the error is changing relative to its respective node / connection, which gives you the best amount and direction to update the weight to move in the direction of decreasing error.
Maybe it would help if I referenced this picture of the sigmoid activation function to point out it's only sensitive to inputs within a narrow range around -4 to +4. If you have data that goes over that point on either end where the function flattens out at 0 or at 1 for output then data that should be interpreted as different like 4 vs. 5 will be interpreted as the same (since they'll both just max out at a 1 for output). This makes the algorithm give answers that miss the mark in either direction with more of that "oscillation" going on because it doesn't actually have a good basis for knowing which way to go. Eventually it finds its way to the center despite that problem, but to get there it had to behave more erratically because the non-normalized data was activating similar outputs for different inputs. The normalized example gets there in a much straighter path because everything's within the sensitive range and the gradient at each node / connection informs a good weight update decision with error reducing results more of the time.

I would assume the issue is with the step size. A steep descent in one direction but a shallow one in another might suggest a step size that overshoots the local minimum, setting up an oscillation.

I am still new to machine learning. I still didn't reach node/ connection notations.
The picture shows how theta 1 and theta 2 are updated.
I don't see why not normalizing the features would cause oscillation. I mean when you see the 3d graph it is a valley. So why the algorithm not just follow the valley like water does ?

as you can see in the figure here
the red line defines how the algorithm steps. I don't understand why it behaves like that