Hi Veeky Forums

Hi Veeky Forums
I got a question that's been bugging me for a while.
In a gradient descent algorithm. Why having a normalized dataset helps converge faster.
The only explanation I found online is that if the contour plot for the cost function J(θ1,θ2) is kind of stretched and thin (pic related). the algorithm oscillates.
Why does it oscillate ? I don't understand why it oscillates

Other urls found in this thread:

coursera.org/learn/machine-learning/lecture/xx3Da/gradient-descent-in-practice-i-feature-scaling
twitter.com/NSFWRedditVideo

Because if you don't normalize your data then relatively small differences between data points might look huge or relatively large differences between data points might look tiny. Like the difference in adult height measured in centimeters might stay within a range of 30 in centimeters while the difference between incomes might span 100,000 in dollars, and the difference in range between those two cases is pretty arbitrary and something you want to get rid of as a consideration because what you really care about is the relative difference between your data points rather than whatever the absolute difference happens to be.

I want to understand from a mathematical point of view. Why data with a stretched contour make it oscillate. I can't grasp this
sorry if i'm being stupid

What you're trying to do is get to the center as efficiently as possible. The fact it's stretched shows you're getting results farther off from the center and that takes more time for the algorithm to get to where you want it to go. And it's returning these far off values on non-normalized data because the pattern you're trying to get it to learn is still there but it's not weighted properly which makes irrelevant happenstances of scale influence the algorithm into making worse choices at first.
You can see the difference in how obvious a trend looks by graphing the data before normalization and comparing it to a graph of the data after normalization. If the data's not normalized it might seem like a huge relative difference in data points is hardly there at all.

but the gradient descent uses the gradient at the point to somehow decide where to go next.
the fact that it is stretched just says there is a big difference between theta 1 and theta 2. I don't see why that would make converging a problem

this is how it is explained in a video course in coursera.
I don't understand why it oscillates like that

>to somehow decide where to go next
When you say somehow do you mean you aren't sure how specifically the gradient would translate into a node / connection weight change? Each gradient provides the information of how much the error is changing relative to its respective node / connection, which gives you the best amount and direction to update the weight to move in the direction of decreasing error.
Maybe it would help if I referenced this picture of the sigmoid activation function to point out it's only sensitive to inputs within a narrow range around -4 to +4. If you have data that goes over that point on either end where the function flattens out at 0 or at 1 for output then data that should be interpreted as different like 4 vs. 5 will be interpreted as the same (since they'll both just max out at a 1 for output). This makes the algorithm give answers that miss the mark in either direction with more of that "oscillation" going on because it doesn't actually have a good basis for knowing which way to go. Eventually it finds its way to the center despite that problem, but to get there it had to behave more erratically because the non-normalized data was activating similar outputs for different inputs. The normalized example gets there in a much straighter path because everything's within the sensitive range and the gradient at each node / connection informs a good weight update decision with error reducing results more of the time.

I would assume the issue is with the step size. A steep descent in one direction but a shallow one in another might suggest a step size that overshoots the local minimum, setting up an oscillation.

I am still new to machine learning. I still didn't reach node/ connection notations.
The picture shows how theta 1 and theta 2 are updated.
I don't see why not normalizing the features would cause oscillation. I mean when you see the 3d graph it is a valley. So why the algorithm not just follow the valley like water does ?

as you can see in the figure here
the red line defines how the algorithm steps. I don't understand why it behaves like that

Imagine two of your input values are significantly different from each other and should produce significantly different outputs. But you don't normalize them. So they produce the same output instead e.g. if one input is 4 and the other is 5, then you can look at the corresponding Y axis values for the X axis values for 4 and 5 on the activation function graph here:
And you'll see both would return about the same output of 1.
Gradients in these cases will say to update the weights in similar ways when really what you need is for the weights to be updated differently depending on if its a 4 input or a 5 input. With normalization, that problem is solved because everything's compacted into sigmoid space and the gradients will actually inform different weight updates for each that are best able to move the path towards reducing errors in a straightforward way. Without it, the weight updates won't result in a straight path of minimized error and will instead meander uncertainly because the activation function is recognizing less of a difference (or none at all) between distinct inputs.

I just recently starting looking into this but: Your cost function is your error, which you're trying to minimize (take the gradient, reverse it). Obviously this is gonna be an iterative process. To keep things simple we're basically taking a slope, and that's telling us how much to change our next guess (higher the slope = more change, lower the slope = less change). Meaning with a high slope caused by not having normalized datasets, you're going to overshoot your next guess, and each successive guess will need "oscillate" around the minimum for longer before finding it. Think of dropping a marble in a steep bowl versus a shallow bowl, now think of which will "find" the bottom first. This is my half-educated guess at least, I haven't looked into this much.

I don't know what an activation function is.
I'm sorry I can't follow your explanation.
i don't think the problem is overshooting because as you can see in the red line in this figure
not normalizing the data wouldn't necessarily cause a higher sloop

coursera.org/learn/machine-learning/lecture/xx3Da/gradient-descent-in-practice-i-feature-scaling
this is the video I'm stuck in

> another ml-let
Thats why you have to learn analysis first.

It is overshooting, except in that case it’s the angle that it’s overshooting

Easy. Can`t reach such "State".

>not normalizing the data wouldn't necessarily cause a higher sloop
idk about "necessarily" as in an "always significant" sense, but basically yes, it ensures you're not doing more work than you absolutely have to. On the other side of that coin, I think it's possible you can still have a normalized but "bumpy" data set, but that's also why you can adjust the sensitivity of your guesses to the specifics of what you're trying to teach the network.

Imagine the function restricted to the line of steepest descent, then the step is overshooting that local minimum of that.

Gradient descent does not follow a curved path. It is a segmented linear path. The direction is chosen based on the gradient, then the function is minimized along the line in that direction.

For "round" data the gradient always points to the global minimum. But as in op's pic this is not true for non round data.

thank you everyone . I think I somehow understood. I hope things would be clearer as I advance in the course