Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

The Olympics of AI: Benchmarking Machine Learning Systems

Matthew Stewart, PhDFollowTowards Data Science--ListenShareYou can’t improve what you don’t measure. — Peter DruckerFor years, running a mile in under four minutes was considered not just a daunting challenge, but by many an impossible feat. It was a psychological and physical Benchmark that many thought was unattainable. Doctors and sports experts theorized that the human body was not capable of running that fast for that long. This belief was so ingrained that some even suggested attempting to do so could be fatal.Sir Roger Bannister, a British middle-distance runner and medical student, thought differently. While he recognized the challenge, he believed that the barrier was more psychological than physiological. Bannister took a scientific approach to his training, breaking down the mile into sections and rigorously timing each one. He also employed a rigorous training regimen based on interval training and set smaller Benchmarks for himself in the lead-up to his record attempt.On May 6, 1954, at a track in Oxford, England, with the help of his friends Chris Brasher and Chris Chataway as pacemakers, Bannister made his attempt to break the four-minute barrier. He completed the mile in 3 minutes 59.4 seconds, shattering the threshold and making history.The aftermath of Bannister’s achievement was highly unexpected. Gunder Hägg’s 1945 record (4 minutes 1.4 seconds) had stood for almost a decade before Bannister came along. However, once the four-minute mile benchmark was broken, others soon followed. Just 46 days after Bannister’s run, John Landy finished a mile in 3 minutes 57.9 seconds. Over the next ten years, the record was beaten another 5 times. The current record, set by Hicham El Guerrouj, stands at 3 minutes 43.1 seconds.Bannister’s achievement illustrates the power of benchmarks, not just as measures of performance but as motivators for change. Once the four-minute “benchmark” was broken, it redefined what athletes believed was possible. The barrier was as much in the mind as it was on the track.The four-minute mile embodies the transformative power of benchmarks across disciplines. Benchmarks provide a way to quantify performance improvements for particular tasks, giving us a way to compare ourselves to others. This is the entire basis for sporting events such as the Olympics. However, benchmarks are only useful if the community they involve can decide on a common goal to be pursued.In the realm of machine learning and computer science, benchmarks serve as the communal Olympics — a grand arena where algorithms, systems, and methodologies compete, not for medals, but for the pride of advancement and the drive for innovation. Just as athletes train for years to shave milliseconds off their time in pursuit of Olympic gold, developers and researchers optimize their models and systems to improve performance, striving to outperform on established benchmarks.The art and science of benchmarking lie in the establishment of that common goal. It is not merely about setting a task, but ensuring it captures the essence of real-world challenges, pushing the boundaries of what is possible while remaining relevant and applicable. Poorly chosen benchmarks can lead researchers astray, optimizing for tasks that do not translate to improvement in real-world applications. A well-designed benchmark can guide a whole community toward breakthroughs that redefine a field.Hence, while benchmarks are tools for comparison and competition, their true value lies in their ability to unite a community around a shared vision. Much like Bannister’s run didn’t just break a record but redefined athletic potential, a well-conceptualized benchmark can elevate an entire discipline, shifting paradigms and ushering in new eras of innovation.In this article, we will explore the crucial role of benchmarking in advancing computer science and machine learning by journeying through its history, discussing the latest trends in benchmarking machine learning systems, and seeing how it spurs innovation in the hardware sector.In the 1980s, as the personal computer revolution was taking off, there was a growing need for standardized metrics to compare the performance of different computer systems: a benchmark. Before standardized benchmarks, manufacturers often developed and used their own custom benchmarks. These benchmarks tended to highlight their machines’ strengths while downplaying their weaknesses. It became clear that a neutral, universally accepted benchmark was necessary for comparison.To address this challenge, the System Performance Evaluation Cooperative (SPEC) was developed. The members of this organization were hardware vendors, researchers, and other stakeholders interested in creating a universal standard for benchmarking central processing units (CPUs), also commonly referred to as ‘chips’.SPEC’s first major contribution was the SPEC89 benchmark suite, which was groundbreaking in that it was one of the first attempts at an industry-standard CPU benchmark. SPEC’s benchmarks focused on real-world applications and computing tasks, aiming to provide metrics that mattered to end-users rather than esoteric or niche measurements.However, as the benchmark evolved, an intriguing phenomenon emerged: the so-called “benchmark effect.” As the SPEC benchmarks became the gold standard for measuring CPU performance, CPU designers started optimizing their designs for SPEC’s benchmarks. In essence, because the industry had come to value SPEC benchmarks as a measure of overall performance, there was a strong incentive for manufacturers to ensure their CPUs performed exceptionally well on these tests — even if it meant potentially sacrificing performance in non-SPEC tasks.This wasn’t necessarily SPEC’s intention, and it led to a spirited debate within the computer science community. Were the benchmarks genuinely representative of real-world performance? Or were they driving a form of tunnel vision, where the benchmarks became an end unto themselves rather than a means to an end?Recognizing these challenges, SPEC continually updated its benchmarks over the years to stay ahead of the curve and prevent undue optimization. Their benchmark suites expanded to cover different domains, from integer and floating-point computation to more domain-specific tasks in graphics, file systems, and more.The story of SPEC and its benchmarks underscores the profound impact that benchmarking can have on an entire industry’s direction. The benchmarks didn’t merely measure performance — they influenced it. It’s a testament to the power of standardization, but also a cautionary tale about the unintended consequences that can emerge when a single metric becomes the focal point of optimization.Today, SPEC benchmarks, along with other benchmarks, continue to play a vital role in shaping the computer hardware industry and guiding consumers and enterprises in their purchasing decisions.In the late 2000s, computer vision, a subfield of AI focused on enabling machines to interpret and make decisions based on visual data, was struggling to make progress. Traditional techniques had made progress, but they were hitting a performance plateau on many tasks. The methods available at the time relied heavily on hand-crafted features, requiring experts to meticulously design and select specific features for each task. It was a tedious process with many limitations.Then ImageNet was released, a massive visual database initiated by Dr. Fei-Fei Li and her team. ImageNet provided millions of labeled images spanning thousands of categories. The sheer volume of this dataset was unprecedented, and only enabled by the ability to crowdsource data labeling through cloud-based approaches like Amazon Mechanical Turk. ImageNet was one of the first dataset benchmarks — since its release, the ImageNet paper has been cited over 50,000 times.But collecting the dataset was just the beginning. In 2010, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was launched. The challenge was simple in its objective but daunting in its scale: automatically classify an image into one of 1,000 categories. This benchmark challenge would provide an objective measure of progress in computer vision, on a scale far beyond previous attempts.The initial years saw incremental improvements over traditional methods. However, the 2012 challenge witnessed a transformative shift. A team from the University of Toronto, led by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, introduced a deep convolutional neural network (CNN) called “AlexNet.” Their model achieved an error rate of 15.3%, slashing the previous year’s error by nearly half!What made this possible? Deep learning, and particularly CNNs, had the capability to learn features directly from raw pixels, eliminating the need for manual feature crafting. Given enough data and computational power, these networks could uncover intricate patterns that were far beyond what traditional methods could manage.The success of AlexNet was a watershed moment in the development of AI. The years following 2012 saw deep learning methods dominating the ImageNet challenge, driving error rates lower and lower. The clear message from the benchmarks was undeniable: deep learning, once a niche area in machine learning, was set to revolutionize computer vision.And it did more than that. The success in the ILSVRC acted as a catalyst, propelling deep learning to the forefront of not just computer vision, but numerous areas in AI, from natural language processing to game playing. The challenge underscored the potential of deep learning, attracting researchers, funding, and focus to the area.By setting a clear, challenging benchmark, the ImageNet challenge played a pivotal role in redirecting the trajectory of AI research, leading to the current deep learning-driven AI renaissance we witness today.The transformative impact of benchmarks like SPEC and ImageNet naturally prompts the question: What’s next? As deep learning models became increasingly complex, so did their computational demands. This shifted attention to another critical component — the hardware that powered these models. Enter MLPerf.MLPerf emerged as a collaborative effort involving industry giants and academic institutions, with the mission of creating a standard set of benchmarks to measure the performance of machine learning hardware, software, and cloud platforms. As the name suggests, MLPerf focuses explicitly on machine learning, capturing a broad spectrum of tasks ranging from image classification to reinforcement learning. The objective was clear — to provide clarity in a field where “best performance” claims were becoming commonplace, yet were often based on inconsistent criteria or cherry-picked metrics.The introduction of MLPerf presented the tech industry with a much-needed unified yardstick. For academia, it provided a clear performance target, fostering an environment where innovation in algorithms could be easily measured and compared. For industry, especially hardware manufacturers, it posed both a challenge and an opportunity. No longer could a new chip be launched with vague assertions about its machine learning performance — there was now a universally accepted benchmark that would put any such claims to the test.And just like SPEC influenced CPU design, MLPerf began shaping the direction of AI hardware. Companies started optimizing their designs with MLPerf benchmarks in mind, and it was not just about raw performance. The benchmarks also incorporated efficiency metrics, encouraging innovations that delivered not just speed but also energy efficiency — a pressing concern in the age of colossal transformer models and environmental consciousness. These benchmarks are used routinely by big tech companies, such as Nvidia and AMD, to showcase their new hardware.Today, there are dozens of MLPerf-like benchmarks that are managed by MLCommons, including:But MLPerf isn’t without its critics. As with any benchmark that gains prominence, there are concerns about “overfitting” to benchmarks, where designs excessively optimize for the benchmark tests at the potential cost of real-world applicability. Moreover, there’s the ever-present challenge of ensuring that benchmarks remain relevant, updating them to reflect the rapid advancements in the ML field.Still, the story of MLPerf, much like its predecessors, underscores a fundamental truth: benchmarks catalyze progress. They don’t just measure the state of the art; they shape it. By setting clear, challenging targets, they focus collective energies, driving industries and research communities to break new grounds. And, in a world where AI continues to redefine what is possible, having a compass to navigate its complexities becomes not just desirable, but essential.Other than AI hardware, large language models, a form of generative AI, are a key focus of benchmarking efforts. More generally referred to as foundational models, these are more difficult to benchmark than hardware or even many other types of machine learning models.This is because the success of a language model doesn’t hinge solely on raw computational speed or accuracy in narrowly defined tasks. Instead, it rests on the model’s ability to generate coherent, contextually relevant, and informative responses across a wide variety of prompts and contexts. Furthermore, evaluating the “quality” of a response is inherently subjective and can vary based on the application or the biases of the evaluator. Given the complexities, benchmarks for language models like GPT-3 or BERT must be more diverse and multifaceted than traditional benchmarks.One of the most well-known benchmarks for language models is the General Language Understanding Evaluation (GLUE) benchmark, developed in 2018. GLUE wasn’t just a single task; it was a collection of nine diverse language tasks, ranging from sentiment analysis to textual entailment. The idea was to provide a comprehensive evaluation, ensuring models were not just excelling at one task but were genuinely capable of understanding language across various challenges.The impact of GLUE was immediate and profound. For the first time, there was a clear, consistent benchmark against which language models could be evaluated. Soon, tech giants and academia alike were participating, each vying for the top spot on the GLUE leaderboard.When GPT-2 was first evaluated against the GLUE benchmark, it secured a then-astounding score that surpassed many models. This wasn’t just a testament to GPT-2’s prowess but underscored the value of GLUE in providing a clear measurement stick. The ability to claim “state-of-the-art on GLUE” became a coveted recognition in the community.However, GLUE’s success was a double-edged sword. By late 2019, many models had begun to saturate GLUE’s leaderboard, with scores nearing the human baseline. This saturation highlighted another critical aspect of benchmarking: the need for benchmarks to evolve with the field. To address this, the same team introduced SuperGLUE, a tougher benchmark designed to push the boundaries further.Benchmarks like GLUE, SuperGLUE, and SQuAD are used to evaluate models on specific tasks such as sentiment analysis and question answering. But these benchmarks only scratch the surface of what foundational models aim to achieve. Beyond task-specific accuracy, other dimensions have emerged to assess these models:An interesting development is that as the performance of language models moves closer to human performance, tests that have historically been used to assess human performance are now being used as benchmarks for language models. For instance, GPT-4 was tested on exams like the SAT, LSAT, and medical boards. On the SAT, it scored 1410, ranking in the top 6% nationally. GPT-4 was even able to pass all versions of the medical board exams, with a mean score of 80.7%. However, for the LSAT, it scored lower with 148 and 157, placing it in the 37th and 70th percentiles.It will be interesting to see how benchmarking approaches continue to develop for language models as they begin to rival and exceed human performance in many areas.The future of benchmarking is evolving rapidly, diversifying to address the broad spectrum of emerging technologies and applications. Here are some examples of emerging areas where benchmarking is being implemented:The computer science and machine learning communities are well aware of the importance of benchmarking for driving progress in their fields. Now, even NeurIPS, one of the flagship AI conferences, has dedicated a track solely for datasets and benchmarks. Now in its third year, this track is gaining immense momentum, reflected in the staggering number of close to 1,000 submissions this year alone. This trend underscores that, as technology continues its relentless march, benchmarks will continue to guide and shape its trajectory in real time, as it has done before.The role of benchmarks in shaping progress, whether in athletics or AI, cannot be overstated. They act both as mirrors, reflecting the current state of affairs, and windows, offering a glimpse into future potentials. As AI continues to influence diverse applications and industries, from healthcare to finance, having robust benchmarks becomes crucial. They ensure that progress is not just rapid but also meaningful, steering efforts towards challenges that matter. As Sir Roger Bannister showed us with his four-minute mile, sometimes the most daunting benchmarks, once conquered, can unleash waves of innovation and inspiration for years to come. In the world of machine learning and computing, the race is far from over.----Towards Data ScienceML Postdoc @Harvard | Environmental + Data Science PhD @Harvard | ML consultant @Critical Future | Blogger @TDS | Content Creator @EdX. https://mpstewart.ioMatthew Stewart, PhDinTowards Data Science--35Heiko HotzinTowards Data Science--16Giuseppe ScalamognainTowards Data Science--12Matthew Stewart, PhDinTowards Data Science--8Dominik PolzerinTowards Data Science--6Salvatore RaieliinLevel Up Coding--11steven lasch--Keith McNulty--32Isha Choudhary--1Automatain𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨--HelpStatusWritersBlogCareersPrivacyTermsAboutText to speechTeams



This post first appeared on VedVyas Articles, please read the originial post: here

Share the post

The Olympics of AI: Benchmarking Machine Learning Systems

×

Subscribe to Vedvyas Articles

Get updates delivered right to your inbox!

Thank you for your subscription

×