September 22, 1999
A Faster Way of Shuffling the Data Cards
By STEVEN E. BRIER
l-go-rithm: 1. a repetitive procedure for solving a problem or accomplishing
some end, especially by a computer; 2. the Veg-O-Matic of the data-mining world; it
slices, it dices and it's faster than a speeding processor.
At its simplest, an algorithm is a set of instructions required to solve a problem. But
at its best, an algorithm is a method of looking at information in ways that are not
obvious to answer old questions better and faster, or to answer questions you didn't know
you had.
"An algorithm is the teeny little engine that makes sense of masses of
data," said Christopher Meyer, director of the Ernst & Young Center for
Business Innovation in Cambridge, Mass.
Without algorithms, Meyer said, much of E-commerce would not be possible. Faster
performance and unexpected results are what drive E-businesses, he said.
Online businesses, and their marketing departments, collect mammoth amounts of data
about their customers. What they looked at, what they bought, where they came from, where
they went, where they live and dozens of other little pieces of information are tracked.
Companies want to analyze that information and use it to tune their pricing, product
lineup and welcoming screens.
"If you have to have a market analyst looking at every person and seeing
what to market to them, it would not be economically feasible," Meyer said. "But
if you can mine the data, you can make a business."
Mining those data requires more than fast processors, said William Pulleyblank,
director of Deep Computing at I.B.M.'s Thomas J. Watson Research Center. A well-designed
algorithm can improve performance in a way that would take years through hardware. Pulleyblank
uses two methods of sorting 1,000 checks into numerical order to illustrate the
improvement in performance that a clever algorithm can provide.
Version one requires rifling through the checks looking for the check with the lowest
number and pulling it to the front, then going back and looking for the check with the
next-lowest number and putting it behind the first. This would work, but you could end up
comparing numbers on more than half a million pairs of checks. That is effective but time
consuming.
Now try something different. Instead of sorting one big stack, divide the pile of
checks in half. Then divide one of those halves in half again, sort each part and merge
them. Now do the same with the other half.
Confusing? Maybe, but you just sorted the checks using only about 20,000 comparisons
instead of half a million. And the savings in time grow exponentially with the size of the
pile.
That's an algorithm in action.