Short testing of resurrection rates

akots · November 8, 2017, 10:02am

I made a habit of recording events of resurrection of the Dragon Soul and Infernal King in PvP battles for the past couple of weeks. So, my total is 75 TDS/IK encounters. The number is probably insufficient to make reliable conclusions but I just got bored and 3 weeks is quite a long time. For statistical calculations, I used real-statistics plugin for Excel 2010. For simulation, I used Excel which has the same Mersenne-Twister pRNG as GoW.

So, I compared actual results and simulation to see whether they come from some uniform random distribution.

Actual data had 24 resurrections including 5 double resurrections. I did not have triple or higher resurrections. So, the percentage is around 32% which might be fairly close to 25% expected.

Simulated data had 20 resurrections. Percentage is 26.7% which is a bit closer to expected 25%.

I then used Runs test (some people call it Wald–Wolfowitz runs test, details are here Wald–Wolfowitz runs test - Wikipedia and here One Sample Runs Test | Real Statistics Using Excel ) to check whether the data come from a random sample.

So, actual gathered data had z-stat value of 1.99 and P value of 0.047. Simulated data had much lower z-stat value 0.1 and P value of 0.92. For those who cannot be bothered to read the math and understand what it means and whether it is applicable: P value less than 0.05 means that hypothesis that the runs are random is rejected which means that the actual data are not random at more than 95% probability. While simulated data are perfectly random. It seems that the test might not be powerful enough with only 75 data points. Chi-squared test of sufficient power around 0.8 might require something around 400 points. But the difference between actual data and simulated data is quite staggering, so I presume that just based on this, sample size is sufficient.

I’m not sure if it means anything. Mostly because the runs for actual data are not sequential and the simulated data runs are sequential. However, since the actual data are derived from a larger set of apparently sequential random data, the actual data might be random although not necessarily.

I can post the numbers if somebody wants to take a look. And yes, I have some professional background in statistics.

Strictly speaking, the only real thing that can be concluded is that my simulation is not adequate. However, it is random according to runs test and is obtained using the same pRNG as GoW.

Ask some questions and I’ll try to answer to the best of my abilities. I’m also running some other tests, and they are not black and white, statistics is never black and white, apart from few clear cases.

Grundulum · November 8, 2017, 11:59am

I suspect that you’re right about 75 examples not being nearly enough.

When I was doing my tribute tracking, I spent a month below average — 8,000+ samples of supposedly independent tribute chances, which wound up being more than three standard deviations below the 26% I was expecting. The next month of samples was a few standard deviations above 26%, so that I had an 8,000-point data set at -3 sigma and an adjacent 8,000-point set at +3 sigma. The average across two months and some 17,000 data points (pinging @Saltypatra, who said she’d keep an eye on my thread) was almost exactly the expectation value.

However, the pendulum swung far away from expectation before correcting almost perfectly. I wish I knew a statistical test for streakiness, because I really think the game is prone to streaks even if the (very) long term numbers work out correctly.

akots · November 8, 2017, 1:10pm

Your purpose was different, you wanted to know the average. I wanted to know the distribution to see whether it is random. Well, it is not and idk what is the reason. With real life data and statistics, 75 samples of binary data is more than enough to make conclusion about randomness as you can see from the results of the simulation. National Institute of Standards and Technology actually has even some guidelines about RNG testing. Random Bit Generation | CSRC People use it a lot in cryptography. A sample of 100 binary data is considered well sufficient to estimate randomness. Larger samples maybe be either periodic or deceptively random because of Law of large numbers.

Using high volume of data should normally be redundant and it leads to substantial bias in evaluating distributions. I would again like to remind about the Law of large numbers which is one of the critical theorems in statistics. Law of large numbers - Wikipedia

There is another important theorem that is called Central limit theorem. Central limit theorem - Wikipedia According to CLT:

CLT is also applicable to multidimensional variables and sum of discrete random variables that is still a discrete random variable. So, in general, in most situations, as long as variables are random and independent, distribution should be close to normal. If distribution deviates substantially from normal, this means that the variables are either non-random or not independent.

With regard to streaks, strictly speaking, some type of cluster analysis might be required. I’m looking into various options. However, runs test by itself belongs to analysis of streak (runs) in the case of binary sequence which is exactly the case here. Streaks in random binary sequence should be approximated by normal distribution. This is exactly the type of test that shows that binary sequence in this case is not random.

BTW, if you have your data on tribute, I’d like to try and run some tests with it to see how random it is. Also, it is of sufficient size to check various options for cluster analysis. The only thing is needed is uninterrupted flow of data in the exact sequence the numbers were generated and there should be no missing points to ensure continuity. So, every time this particular number was generated, it should be recorded. If there are too many missing numbers, cluster analysis won’t work. But if you know how many numbers are missing and when exactly, that would work.

Grundulum · November 8, 2017, 1:31pm

Speaking of streaks, I’m quite irritated that I lost to your team in Guild Wars yesterday. Seemed like every time I made a match, either a skull match or a 4+ match dropped for your team.

—————

I did not track tribute counts individually. I recorded the totals every 30-90 collections. During this time, I only failed to record two results. It wouldn’t affect tests of the average, but it might affect streakiness. I’ll see if I still have the data, and I’ll PM it to you if the clumpy format isn’t an issue for running statistical tests.

akots · November 8, 2017, 1:54pm

Sorry about GW, that defense team has been thoroughly tested by me and guild mates to ensure maximal annoyance for red days. Spirit fox removes yellow, Apothecary usually targets red and green spawned by Forest troll is useless for the opponent on a red day even if it misfires.

For testing of streaks, I need a sequence of raw data which is how many kingdoms gave you tribute, this is most important part. To run the simulation, I need to know the exact chance for each kingdom at the time of tribute generation. This may be interesting and super-multidimensional. The size of tribute itself is not very helpful as it is not random and most likely won’t allow tracing to exact kingdoms since there are too many variables. I can then compare distribution of streaks in simulation and actual data and see what specific tests work well. If there are 8000 points indeed, it might be possible to do some cluster analysis to pinpoint the problem if there is any.

Grundulum · November 8, 2017, 2:35pm

Like I said, I didn’t record the sequence of tributes. I just wrote down the totals every 30-90 collections (each of which represented 29 theoretically independent Bernoulli trials at 26% success chance). I’m no longer a good person to do tribute tracking, unfortunately. I have 28 kingdoms at 20% chance of tribute, and one each at 10% and 30%. Sure, that still averages to 20%, but I am unsure what effects that would have on any sort of streakiness, as well as how the +6% from the red statue plays with kingdoms at 10/20/30% tribute chances.

akots · November 12, 2017, 10:48pm

I have slightly expanded the test from 75 to 100 samples to see if the trend continues. And yes, it does continue.

I want to emphasize again that the purpose of the test is NOT to confirm that resurrection rate is 25% or whatever. The purpose of the test is to determine whether resurrection is random and independent. Here are updated results.

Actual data has 35 resurrections including 5 double and 1 triple. Total percentage is 35% which is within acceptable error range of assumed 25% at P=0.05 for this sample size.

Simulated data had 28 resurrections, percentage is 28%.

However, as I wrote previously, according to the Runs test, actual data have z-stat of 2.05 and P value of 0.041. While simulated data have z-stat 0.420 and P value is 0.675.

This means with 95% probability that the actual results are not random and independent. This is very firm conclusion and I am completely sure that sample size is absolutely enough. Also, it is highly unlikely that the data points are not independent since the data were obtained through multiple sessions and even encountering fully traited Dragon Soul should be an independent event. Otherwise I have to presume some conspiracy that facing TDS that resurrected (or not) changes your chance of encountering TDS which is obviously not true.

So, the only reasonable outcome is that resurrection event is not randomly determined by the game. The evidence of the Runs test is sufficient to make this statement at more than 95% probability. And developers might want to explain that.

Runs test is a quite moderately powerful tool. If some random and independent event passes the Runs test it does not mean that the event is actually random and independent. But if it fails the test, it certainly means that it is not random and independent. There is quite a lot of literature about that and the test is extensively used in determining the quality of pRNG. And the pRNG itself in its pure form passes the test just fine. That means that GoW has something that alters the results of pRNG. So far, I am not aware of anything that developers told us about non-randomness of resurrection events.