To explain what is the Hyperball algorithm, I will re-use a piece of e-mail I sent to a friend some time ago. I don't feel guilty for doing that, but this post is really a bit heavy. So read it with caution.
To best explain the algorithm, one needs to focus on a particular problem. Let's just do that.
Imagine you want to measure the average temperature of a meteorite when it falls on the ground. Temperature is a scalar field: it is a real quantity and, in our example, it can be thought to be a function of several variables we can determine along with it (the data we collect of several falls of stones to the ground): longitude, latitude, but also, to some degree, barometric pressure, the medium you are measuring (whether water or sand or ground or ice or air), etcetera. You collect data from all over the world, and you get a string of numbers per each "event" (a fall, measured with your standard meteorite-temperaturer): latitude, longitude, height above sea level, barometric pressure, medium on which the measurement was made, temperature of the specimen.
So, what is the average temperature of fallen rocks ? Well, if you had no other information than the last bit in the string above, you would just have to make the average of the measurements of temperature. A bit better, though, is to plot all temperatures on a histogram, and fit the histogram with a gaussian function, extracting the mean and its error from the fit. Why a gaussian curve ? Well, because that is the typical behavior of random fluctuations in a measurement such as the one we're making.
Now, the gaussian will be wide. Not because of the differences in temperature of the specimens, but because of the various biases each measurement was subjected to: in the water, most measurements will be lower. In the desert, you'll measure a higher temperature. At 12000 feet low temps, at very low barometric pressure low temps also (probably it's raining). You get the idea. Your measurements suffer from imprecisions due
to systematic effects, which are much more important than the true statistical spread of the temperature measurement due to the inherent accuracy of your measuring instrument.
Ok, so you'd very much like to correct for these biases. Enter the Monte Carlo. It is a simulation of rocks falling through the atmosphere, which takes into account - as well as you can model it - the climate of the earth as a function of longitude and latitude, the lakes and seas, the pressure, etcetera. With a Monte Carlo simulation you can obtain a billion "events", which span different measuring conditions resembling those of the data you have (the real falls).
What one typically would do with this billion events would be to study the correlation between - say - the latitude and the temperature mismeasurement (since in a simulated rock fall you know the exact temperature and you also know the measured one - through a simulation of the measuring device). You would then have a correction function to apply
to your measurements in the real rock falls, say for instance
Temperature_true = Temperature_measured + 5.5 x cosine(latitude)
which tells you that at the equator the temperature measurement is fine, while at north and south pole you have to add 5.5 degrees.
The problem with this simple approach is that by extracting single-variable correction functions such as the one above, one completely forgets about the hidden intercorrelation among the biases. So, for instance, if my Monte Carlo shows that at the south pole I have to add 5.5 degrees (since, from the formula above, 5.5 is bias_1(latitude=-90 degrees)), and if it also says I have to add 1 degree per every 500 meters of altitude above sea level:
Temperature_true = Temperature_measured + 1 x altitude/(500 m)
I would add 5.5 degrees plus 4 degrees (bias_2(altitude=2000m)) for falls at the south pole, which has an average altitude of 2000 meters above sea level: but that is wrong, since in
determining my single-variable functions bias_1(lat) and bias_2(altitude)
I did not take into account their mutual correlation: bias_1 is biased by altitude, and bias_1 is biased by latitude! I must not add 7.5 degrees to a measurement made at the south pole: maybe 6 degrees are enough, since most of the variation with altitude was already incorporated in my function bias_1(lat) when I determined it from the MC.
So, this only worsens if you have 18 variables instead of four. The problem has to be solved by an algorithm.
Enter the Hyperball(C) algorithm. What it does is to consider this billion Monte Carlo rock falls as a billion measurements of the temperature mismeasurement (D=T(true)-T(meas), remember? For MC events we KNOW the true temperature) in the hyperspace of the 5 variables latitude, longitude, pressure, altitude, medium. That is,
D=D(lat,long,pres,alt,med).
The scalar field we care about is now not the temperature, but the temperature mismeasurement D. If we know that exactly, bingo! For a real rock fall, we will just correct the temperature by using the five additional measurements of lat, long, pressure, altitude and medium: they will give us D=T(true)-T(meas) for that rock fall, so we will ADD to our
T(meas) the scalar field D, and we will get a measurement of T which is only limited by the intrinsic resolution of our measuring device, free of biases!
But... a scalar field is a continuous thing. We have a billion MC events, but still, they just sample D, they don't DETERMINE it fully.
How, then, to use the MC events at our best ? For each real rock fall, we determine its lat, long, alt, pres, med, and go to that particular point in the 5-dim hyperspace. There, we collect the closest MC events, and we average the scalar field D using them. Picking the closest events means inflating a hyperball around the chosen point of space, capturing
MC events inside it. These MC events will have the same biases of the real rock fall, or nearly so. Thus, their average D will be a good measurement of the true scalar field D in the point determined by the real values of the five vars measured for the real rock fall.
Now, that is very simple. However, since we have LIMITED MC statistics, we need to be smarter. With infinite MC stats D would be known with whatever precision one needs, but with limited MC, we need to pick the MC events which are more relevant for our average.
You might have noticed that I did not say what form the hyperball had. I just said I picked the closest events. But wait a minute. What does it mean, closest ? Is the water of the Hudson river closer to the water of the Potomac, or to the ground of Central park ? We have a problem! We have very different variables, and we need to be careful how we
define a distance, a generalized one to be sure. Our distance will need to be of this form:
d^2 = k1(lat1-lat2)^2 + k2(long1-long2)^2 + k3(pres1-pres2)^2 +
k4(alt1-alt2)^2 + k5(med1-med2)^2.
That is, we need to rely on a WEIGHT VECTOR k=(k1,k2,k3,k4,k5) which gives the proper weight to each coordinate of the space. Imagine, for instance, that the last coordinate was not the medium, but the day of the week: 1 through 7 for monday through sunday. Irrelevant, right ? We cannot assume that to obtain our best measurement of the bias D we should ever average events occurred on the same day of the week, disregarding if they have happened at the same latitude, for instance. We need to DE-WEIGHT that coordinate. The hyperball will become a hypercylinder if we set k5=0, because the day of the week will weigh zero in determining the closest events to the one for which we wish to determine the bias from MC.
So, we need to determine the proper weight of the coordinates, that is the weight vector. Therefore, the problem has become one of minimizing the width of our corrected temperature distribution, T'=T(meas)+D, by varying the coordinates k1,k2,k3,k4,k5.
This minimization can be performed by standard means, but it is very time consuming! However, there is a work-around that provides excellent results and does not require normous CPU power. I am currently working at an implementation of the algorithm which includes this new idea, and I am getting better results than the ones shown in the previous post, with a hundredth of the CPU time investment!
I am very excited by all this. I hope I will be able to show some more concrete results here soon.