AI has evolved: It's time for better evaluations and report cards ...Middle East

The Hill - News
AI has evolved: Its time for better evaluations and report cards

The new Llama4 model launched by Meta earlier this month supposedly has "unrivaled speed and efficiency." But does that actually make it the best AI available? Not necessarily.

AI models often get passing grades on standard benchmarks and leaderboards, but still make embarrassing mistakes when used in the real world.

    For example, Google's Gemini should have "outperformed human experts," according to several benchmarks. Yet, the AI created offensive historical images anyway. 

    Other leading AI models have lied about real people, advised businesses to break the law and excluded certain groups of people from getting jobs. It's a sign that our methods for evaluating the effectiveness of AI don't translate to real-world applications and outcomes. 

    We need report cards that evaluate AI more holistically.

    Today, AI models are evaluated based on benchmarks that often measure how well the models perform certain tasks. Examples of popular metrics include how accurately a model can answer questions, how accurately it determines the next logical outcome in a series of events, how well it writes code and understands images and whether people like the AI's content.

    Several of these metrics are relics of a time when AI was less capable and widespread in society, and they aren't accurate measures of an AI's ability to perform the tasks asked of it today. 

    While autonomous vehicles need to correctly identify objects with their onboard sensors, that alone isn't enough to avoid collisions. They also need to understand how traffic will move and how to quickly respond to an erratic, unsafe driver. 

    An AI designed to help with difficult medical diagnoses or hiring decisions shouldn't adopt the racial, socioeconomic and ethnic biases inherent in training datasets. But if an AI gets passing grades on benchmarks that don't evaluate performance in these areas, the model might appear more capable than it truly is. 

    The consequences could be troubling in emerging, high-stakes applications.

    To really understand how well an AI performs, we need to develop outcome-focused benchmarks that are more relevant for today's application areas. Want to make a chatbot that can give legal or business advice? We need to know whether its recommendations will be legal, feasible and result in the desired outcome. 

    We also need to know about the AI's other costs. Are its benefits worth the growing water and energy costs, and are we comfortable with all of the long-term implications of its use?

    How can we get more effective benchmarks? 

    For starters, AI developers need to learn what makes their products effective from the perspective of their stakeholders and customers. With that knowledge, companies and benchmark developers can ensure their data can adequately evaluate AI. 

    We will also need to learn how to better determine, detect and minimize the harmful, unintended consequences of AI before it's launched to the public. Collaborations between AI developers and medical, business and law professionals will be essential to real success.

    Benchmark developers also need to ensure the data used for benchmarking isn't also used to train AI, which could make a faulty AI pass as effective because it's been evaluated with familiar data. 

    Ideally, benchmark and AI developers would coordinate to ensure that training and benchmarking datasets don't overlap. When this is difficult or impossible, benchmark developers need to keep their datasets fresh to reduce the chances that benchmarking data has accidentally been absorbed into training data. 

    Effective benchmarks will require constant maintenance; the risk of overlapping with training data grows as benchmarks grow older.

    Policymakers can also ensure we get effective AI benchmarks by putting the right incentives in place. Certifications could help companies prove that their AI is effective if they require transparent, outcome-focused benchmarks that are verified by human experts. Regulations could also enforce best practices, such as testing the public versions of AI models rather than optimized offshoots, as well as ensuring that benchmarks are transparent and relevant. 

    But for regulations to be effective, we will need new, dedicated agencies to develop and enforce the standards, since AI doesn't neatly fall into the jurisdictions of existing agencies.

    AI is a powerful tool, with the potential to do a lot of good and a lot of harm. Functional benchmarks and benchmarking policies could help us push AI in the right direction.

    Lu Wang is an associate professor of Computer Science and Engineering at University of Michigan and David Jurgens is an associate professor of Information at the University of Michigan School of Information and associate professor of Electrical Engineering and Computer Science at the University of Michigan College of Engineering. They are coauthors of “Evaluation Framework for AI Systems in 'the Wild.'"

    Read More Details
    Finally We wish PressBee provided you with enough information of ( AI has evolved: It's time for better evaluations and report cards )

    Also on site :