How law-abiding is AI? ETH researchers put it to the test

The EU AI Act is designed to ensure that AI is transparent and trustworthy. For the first time, ETH computer scientists have translated the Act into measurable technical requirements for AI. In doing so, they have shown how well today's AI models already comply with the legal requirements.

A man looks at a box containing a large number of neural networks (the so-called black box of AI)
More transparency about how AI achieves its results. This is one of the goals of the EU AI Act. Computer scientists at ETH are now paving the way for how this requirement might technically be expressed and evaluated. (Image: AI-generated)

In brief

  • Researchers at ETH Zurich, together with partners, have provided the first comprehensive technical interpretation of the EU AI Act for General Purpose AI (GPAI) models.
  • They present the first "compliance checker", a set of benchmarks that can be used to assess how well different AI models comply with the likely requirements of the EU AI Act.
  • They applied their benchmarking to 12 prominent large language models (LLMs). In doing so, they identified some fundamental challenges regarding the technical implementation of AI legislation.

Researchers from ETH Zurich, the Bulgarian AI research institute INSAIT – created in partnership with ETH and EPFL – and the ETH spin-off LatticeFlow AI have provided the first comprehensive technical interpretation of the EU AI Act for General Purpose AI (GPAI) models. This makes them the first to translate the legal requirements that the EU places on future AI models into concrete, measurable and verifiable technical requirements.

Such a translation is very relevant for the further implementation process of the EU AI Act: The researchers present a practical approach for model developers to see how aligned they are with future EU legal requirements. Such a translation from regulatory high-level requirements down to actually runnable benchmarks has not existed so far and thus can serve as an important reference point for both model training as well as the currently developing EU AI Act Code of Practice.

The researchers tested their approach on twelve popular generative AI models such as ChatGPT, Llama, Claude or Mistral - after all, these large language models (LLMs) have contributed enormously to the growing popularity and distribution of artificial intelligence (AI) in everyday life, as they are very capable and intuitive to use. With the increasing distribution of these - and other - AI models, the ethical and legal requirements for the responsible use of AI are also increasing: for example, sensitive questions arise regarding data protection, privacy protection and the transparency of AI models. Models should not be "black boxes" but rather deliver results that are as explainable and traceable as possible.

Implementation of the AI Act must be technically clear

Furthermore, they should function fairly and not discriminate against anyone. Against this backdrop, the EU AI Act, which the EU adopted in March 2024, is the world's first AI legislative package that comprehensively seeks to maximise public trust in these technologies and minimise their undesirable risks and side effects.

"The EU AI Act is an important step towards developing responsible and trustworthy AI," says ETH computer science professor Martin Vechev, head of the Laboratory for Safe, Reliable and Intelligent Systems and founder of INSAIT, "but so far we lack a clear and precise technical interpretation of the high-level legal requirements from the EU AI Act. This makes it difficult both to develop legally compliant AI models and to assess the extent to which these models actually comply with the legislation."

The EU AI Act sets out a clear legal framework to contain the risks of so-called General Purpose Artificial Intelligence (GPAI). This refers to AI models that are capable to execute a wide range of tasks. However, the act does not specify how the broad legal requirements are to be interpreted technically. The technical standards are still being developed until the regulations for high-risk AI models come into force in August 2026.

"However, the success of the AI Act's implementation will largely depend on how well it succeeds in developing concrete, precise technical requirements and compliance-centered benchmarks for AI models," says Petar Tsankov, CEO and, with Vechev, a founder of the ETH spin-off LatticeFlow AI, which deals with the implementation of trustworthy AI in practice. "If there is no standard interpretation of exactly what key terms such as safety, explainability or traceability mean in (GP)AI models, then it remains unclear for model developers whether their AI models run in compliance with the AI Act," adds Robin Staab, Computer Scientist and doctoral student in Vechev's research group.

Test of twelve language models reveals shortcomings

The methodology developed by the ETH researchers offers a starting point and basis for discussion. The researchers have also developed a first "compliance checker", a set of benchmarks that can be used to assess how well AI models comply with the likely requirements of the EU AI Act.

In view of the ongoing concretisation of the legal requirements in Europe, the ETH researchers have made their findings publicly available in a study. They have also made their results available to the EU AI Office, which plays a key role in the implementation of and compliance with the AI Act - and thus also for the model evaluation.

Enlarged view: Presentation of the ethical principles, technical requirements and benchmarks
Overview of the structure of the new COMPL-AI benchmarking suite. Starting from the six ethical principles of the EU AI Act (left), the researchers extract corresponding technical requirements (middle) and connect those to state-of-the-art LLM benchmarks (right). (Figure: Lattice Flow, SRI Lab, INSAIT)

In a study that is largely comprehensible even to non-experts, the researchers first clarify the key terms. Starting from six central ethical principles specified in the EU AI Act (human agency, data protection, transparency, diversity, non-discrimination, fairness), they derive 12 associated, technically clear requirements and link these to 27 state-of-the-art evaluation benchmarks. Importantly they also point out in which areas concrete technical checks for AI models are less well-developed or even non-existent, encouraging both researchers, model providers, and regulators alike to further push these areas for an effective EU AI Act implementation.

Impetus for further improvement

The researchers applied their benchmark approach to 12 prominent language models (LLMs). The results make it clear that none of the language models analysed today fully meet the requirements of the EU AI Act. "Our comparison of these large language models reveals that there are shortcomings, particularly with regard to requirements such as robustness, diversity, and fairness," says Robin Staab. This also has to do with the fact that, in recent years, model developers and researchers primarily focused on general model capabilities and performance over more ethical or social requirements such as fairness or non-discrimination.

However, the researchers have found that even key AI concepts such as explainability are unclear. In practice, there is a lack of suitable tools for subsequently explaining how the results of a complex AI model came about: What is not entirely clear conceptually is also almost impossible to evaluate technically. The study makes it clear that various technical requirements, including those relating to copyright infringement, cannot currently be reliably measured. For Robin Staab, one thing is clear: "Focusing the model evaluation on capabilities alone is not enough."

That said, the researchers' sights are set on more than just evaluating existing models. For them, the EU AI Act is a first case of how legislation will change the development and evaluation of AI models in the future. "We see our work as an impetus to enable the implementation of the AI Act and to obtain practicable recommendations for model providers," says Martin Vechev, "but our methodology can go beyond the EU AI Act, as it is also adaptable for other, comparable legislation."

"Ultimately, we want to encourage a balanced development of LLMs that takes into account both technical aspects such as capability and ethical aspects such as fairness and inclusion," adds Petar Tsankov. The researchers are making their benchmark tool COMPL-AI available on a GitHub website to initiate the technical discussion. The results and methods of their benchmarking can be analysed and visualised there. "We have published our benchmark suite as open source so that other researchers from industry and the scientific community can participate," says Petar Tsankov.

The results of five models on the new COMPL-AI benchmark suite, grouped per ethical principle. The benchmarking works with values: 1 indicates that the requirements of the AI Act are fully complied with. At O, they are not met at all. (Table: Lattice Flow, SRI Lab, INSAIT)

Reference

Guldimann, P, Spiridonov, A, Staab, R, Jovanovi?, N, Vero, M, Vechev, V, Gueorguieva, A, Balunovi?, Misla, Konstantinov, N, Bielik, P, Tsankov, P, Vechev, M. COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act. In: arXiv:2410.07959 [cs.CL]. DOI: external page https://doi.org/10.48550/arXiv.2410.07959

JavaScript has been disabled in your browser