Wat do

wat do

please be gentle

x$AGE goes into every field

What are bullshit points?

Its data from 39 sample, y axis is hours worked and x axis is age. I'm trying to do a linear regression model, but these two blue points (highlighted like shit) are messing my line. The red line is far better, but do I have any right to simply remove those blue points?

oldfags and newfags, obviously

They're outliers. Note that you removed outliers in your report, use the good fit, and move on with your life.

This They're outliers of the independent variable, too, so just say that you reduced your sample space to subjects between thirty years and sixty years or whatever. These data wouldn't be sufficient for interpolating to anything outside of that range in the first place.

Meant to say thirty-five through forty-five for the reduced sample space.

Why did you not get any people aged 25-35 or 45-55?

On purpose. Professor modified the data to rustle our jimmies thoroughly
>tfw being tested on things you haven't been taught

The linear regression is crap either way. It's clearly random what hours the middle-aged work.

Use Chauvenet's criterion, faggot

check the influence on the model using cook's distance criterion and remove them if the cutoff is above 4p/N, where N is the observations and P is the number of predictors you want, which seems to be two here, one slope and another intercpept.

Thank you

I also heard about some rule that you are allowed to remove at most 5% of your data if the data is anomalous

Good teacher.
What do you think the distribution might have looked like with a better sample?

Even without the bullshit it looks wrong.

Competely random desu. It's not linear but either way the points are bullshit
Yeah, but you got to prove it

Blue line is a better fit, you shouldn't just remove points that weren't the result of an error

>Blue line is a better fit
literally opposite of definition

Why would you even do a regression on this?

to prove it's not linear.

What's not linear?

What if it is linear?

People could be working slightly longer hours as they get older with maximum variability around 40.

Ransac

The model
Well I went on to prove it's linear, got my assumptions violated
This is a problem for GLM

Don't you have to assume it's linear before you do a linear regression?

That's what i just said you doughnut

so you assume it is linear just to prove that it isn't?

Yup

But if you proved it's linear and your assumptions were violated, doesn't that mean your assumption was that it was non-linear?

Your data is garbage, dude. If you leave those points in it's garbage, if you take them out it's garbage.

What that graph says is that Age is a terrible predictor for Hours.

I did not proved it's linear. I assumed the model is correct, but the assumption OF the model were violated. Thus not linear

What about their correlation? It increases substantially if we remove those data points . I agree the model sucks, but there is obviously a relation there.

>but there is obviously a relation there.

If every data point was the same age, let's say 40, what do you think the line of best fit would look like?

But they go from 38 to 45. That's 7 years
I get your point, butstill, what's with the correlation? it's -0.5

There are no rules for removing data, and 5% seems excessive to me. It's mostly an eyeball call anyway.

A linear regression will always produce a line of best fit, and a correlation statistic will always be found. That doesn't mean either of those are meaningful. Remove those data points and show us the residuals and line of best fit with just the central cluster and we'll see how it looks.

>approximating random points with arbitrarily chosen line that you think looks prettiest
is statistics really considered maths?

>Arbitrarily chosen
>What is least squares

The data makes no sense. Why would a 39 yo work significantly shorter hours than a 42 yo?

[eqn]Y=\lambda f.\ (\lambda x.\ f(xx))(\lambda x.\ f(xx))[/eqn]

i think that was an attempt to reference 1/20 ______ (data points, what have you..) will be statistically significant


>>>thr0wiing out datA is bad scINCe

if the linear line does not fit that's fine...an f-value and a p-value should explain why that is

They won't explain that at all, and no, it's not fine if the line doesn't fit well. That means you shouldn't settle on that model.

he mentioned he has to do a linear regression on this data

>this obviously isn't being used as a real analysis

underrated

Without those outliers the data is pretty much meaningless.
>"Wow there's a fairly even distribution of hours worked in the age group 35-35!"
And that red line is utter bullshit, implying that the data suggests that the amount of hours the average man works drops to 0 at just past the age of 45, even when you have a fat cloud of outliers that clearly disproves that. My prediction is that the professor will tear you a new one if you drop the so-called outliers. Especially since it's hard to tell but it looks like almost half your points are in those two clouds of outliers.