A Note on Root-Finding
5 minute read.
Have you ever accidentally advanced the state of the art in a field that's not your specialty? I'd guess it's more common than we expect, since people in this situation don't generally know how to publicize their findings. I'm going to relate my own small example, in hopes someone who needs it can find it.
This being not my field of specialty, I go to Wikipedia to see what a good root-finding algorithm is. There are several, but they tend to have the same set of tradeoffs: they can get faster conversion sometimes, but in the worst cases they are worse, often far worse, than the bisection method. In fact, it's been proven that the bisection method is the best you can do in the worst case, hard stop. But then I see something that sounds promising: the ITP method by Oliveira and Takahashi that purports to have the same worst case performance as bisection, but with better average performance. Great, let's try that!
I implement this ITP method, and it does do better than bisection, but not really by all that much. It has several tuning parameters, but they don't make much sense and changing them doesn't seem to really help. However, its behavior from iteration to iteration is very slapdash - sometimes lots of improvement, sometimes almost none. And there is no intuitive explanation of why the algorithm is built the way it is on the wiki site. Time to look at the paper.
Ugh, it's behind a paywall; I eventually find a free version - it's 29 pages long! So much for brevity is the soul of wit. Also it's from 2021? I sort of thought root-finding would be a pretty solved problem by now (the other algorithms were mostly from the 1970's), but I suppose this might explain the poor state of the wiki page at least.
It takes some digging, but the paper does in fact explain things much better. ITP stands for "Iterate, Truncate, and Project", which I don't think explains it very well. I would call it IPT for "Interpolate, Perturb, and Truncate", because what it actually does is calculate the interpolated crossing (regula falsi in math speak), push that point a little toward the middle, then bound that point within a window around the middle. The last step is the clever part - the window gets smaller according to the size of the corresponding bisection step, which gives them the guarantee of no worse performance than the bisection method.
How does that work? Basically they take advantage of the fact that when you're aiming for some finite tolerance value, since bisection gets twice as close each time, you'll actually get a little better than you needed on the last iteration, since it's unlikely to line up exactly. So they use that extra as wiggle-room in the beginning to try and pick a better estimate than the midpoint. If it works, then they have even more wiggle room, but if not, then the window closes and you're back to the guaranteed bisection path of picking the midpoint.
Okay, so what's up with step 2: pushing the interpolated point toward the middle? It turns out the paper says you can do this any way you want - they have a kind of complicated way that supposedly proves fast convergence, but in practice I'm not impressed. So intuitively, why don't we want to just choose our best estimate of where the root is (regula falsi) and use that?
The reason is that the real goal of iterative root-finding is to reduce the distance between your two bounding points. Let's say your root is 1/10th of the way from the below point to the above point. Your function isn't perfectly linear, so your interpolated guess is not perfect, so you have a 50% chance of being slightly above or slightly below zero. If you're above, great! You reduced your distance to 1/10th of what it was - much better than the 1/2 of the bisection method. If you're below though, now you've only reduced it to 9/10th of what it was, which is quite a bit worse than bisection. See an example in the figure:
Red: the function, Blue: interpolation. Below are three options for the next interval: Black: bisection, Purple: regula falsi, Green: perturbed toward the middle. Shorter is better! |
If this were really a coin flip each time, that might be okay, but it's not. Consider why your guess was slightly below: for well-behaved functions that are reasonably smooth and continuous, it means it had positive curvature (a slight smile). On the next iteration, your points haven't moved much, the curvature is the same, so your next point will also err below, again picking the side that doesn't reduce the interval much. This will repeat, never moving the far point any closer.
This is why the authors recommend shifting the guess toward the middle - the idea is to bias enough so you pick the side that will actually shorten the interval. Their method did that, but their bias (as a fraction of the current interval) decreased quickly. This meant it worked well for a few iterations, but then often got caught in this trap as the bias got so small it couldn't overcome any curvature anymore. Then it would get stuck recalculating nearly the same point over and over until the bisection bound kicked in again and it went back to using the midpoint.
The funny thing is they show exactly this behavior in the plot of their algorithm's progression as compared to others - I think they thought of this as showing an example of their worst-case performance bound, but from my testing, I found this worst-case behavior about half the time! So I decided to try something much simpler: just pick a weighted average of the interpolated point and the midpoint. The weight is the single parameter: more toward the interpolated point gives better median performance, but the worst-case situation also happens more often. Too close to the middle and the median becomes the worst-case.
I found moving 1/10th of the way toward the midpoint to give good average performance in the functions I tested. In fact I was seeing about 1/3 of the iterations the bisection method required, and about 1/2 of what I was getting with the ITP from the paper, all while maintaining the same worst-case as bisection! Actually we add one iteration to that worst case as the authors recommend to give more wiggle-room at the beginning, but what's one iteration between friends? Especially now that this worst case is relatively rare.
Did I really just come up with a new state-of-the-art root-finding algorithm from debugging a Wikipedia article? Kudos to the authors for coming up with a useful guarantee that holds for a whole class of root-finding algorithms. You can find my open-source implementation here (also conveniently unitless and optimized for n-dimensional domain), and hopefully you can further improve upon it.
Comments
Post a Comment