Help me out Veeky Forums. Does this look like a negative binomial distribution, Poisson distribution...

Help me out Veeky Forums. Does this look like a negative binomial distribution, Poisson distribution, or some other distribution? Why?

Other urls found in this thread:

en.wikipedia.org/wiki/Stopping_time)
reference.wolfram.com/language/ref/FindDistribution.html
reference.wolfram.com/language/ref/FindDistributionParameters.html
ufile.io/8xcui
reference.wolfram.com/language/ref/CensoredDistribution.html
twitter.com/SFWRedditImages

Kinda looks like Boltzmann doesn't it?

why dont you do some parameter fitting and find out how good you can make each distribution match?

besides, from a philosophical level, the question is kinda pointless. there are always going to be deviations between empirical data and a hypothetical distribution, even if you've chosen the correct model. the real question is, how much deviance are you willing to tolerate? if you choose negative binomial, are you willing to tolerate the error in your analysis?

It's discrete tho.

...discrete approximation of an offset Gamma?

I think it can't be a Poisson or negative binomial distribution because it's 0 at 0.

I'm just trying to find a plausible theoretical distribution that is generating this empirical distribution.

That's too broad. You should think about what kind of process is producing your data and try to match that to a distribution, otherwise you'll probably just overfit.

What are you recording?

It's the stopping time (en.wikipedia.org/wiki/Stopping_time) for a stochastic process.

Then why would it be a discrete distribution? Are you parameterizing a stochastic process by a natural number and taking the mean of a bunch of stopping times?

Oh nevermind it's probably a discrete time process.

It's a discrete-time stochastic process.

Some attempts at fitting.

Forgot to mention, I used scipy.optimize.cuve_fit.

If you have Mathematica there's functions like
reference.wolfram.com/language/ref/FindDistribution.html
reference.wolfram.com/language/ref/FindDistributionParameters.html

If not detail the process and I can try running them for you.

I have Mathematica. Is there a version of these functions that takes in an empirical pmf/pdf (in this case, list of probabilities for each time value) rather than the data itself? I'm using ~1 million data points or more.

>How much deviance are you willing to tolerate
Furries are where I draw the line, personally.

It is a lognormal distribution. Mathematica has an excellent fit function for it.

Not that I know of. A million isn't that many, I'd just bite the bullet or maybe take a random sample and try the functions on that.

Not sure why the log normal fit looks like that.

Fixed typo (model3 to model4) and removed the location and scale parameters from GammaDistribution. Looks much better now.

Upload samples.dat somewhere.

ufile.io/8xcui

I ran the stochastic process up to a maximum of 99 steps, any sample of the stochastic process that continued after that is labeled as -1.

Inverse Gaussian distribution looks like a pretty tight fit. The plot thickens.

Have you tried a Landau distribution?

Doesn't the Landau distribution have support on negative values, tho?

no idea. It looks like a photopeak efficiency curve for a scintillator that I've used before.

You could also try a*exp(bx^2+cx+dx^-1 + fx^-2 etc)

Here's what FindDistribution suggests in case you haven't already tried that.

Let me say though that the number of runs that went past 99 seems excessive to simply delete them. As you can see at the bottom here, the best distribution Mathematica could find assigns only a tiny tiny fraction of the probability mass to values beyond 99, whereas the actual data suggests that as much as a quarter of the mass should be found there. I don't know if there's some way to constrain the search to account for this but it's something to think about. Maybe try looking for some heavy-tailed distributions.

Huh, you're right. The fitting process would naively think that values > 99 have zero probability. Not sure how to constrain it to account for that. MLE on the range [0..99]?

I guess I could go back to fitting the PDF of each distribution to the points we do have values for.

Try replacing all the '-1's with '100's and fitting the same distributions with this applied to them:
reference.wolfram.com/language/ref/CensoredDistribution.html
with xmin = 0, xmax = 100

Shouldn't it be TruncatedDistribution?

That's what you'd want to use if you just ignored all the -1s.

I see. This is what it looks like now.

I forgot to mention that there is a nonzero probability that the stochastic process never stops, hence some of the samples don't stop at *any* time. Any ideas on how I can handle such a semimeasure/"deficient" distribution?

Something just occurred to me: Maybe a mixture distribution with one sub-distribution being one of the parametrized distributions and the other being a delta at 100.

any chance you could upload something more broadly compatible? i was going to play around with this in R until I noticed it was a mathematica file format

readBin('~/Downloads/samples.dat', "int", n=1e5, size=8)

this works fine for me

ah, thanks, I didn't know about that function