Ĵý

Skip to main content
Markkula Center for Applied Ethics Homepage

Beyond Explainability: The Ethical Imperative for Validated and Interpretable Clinical AI

Two doctors examine MRI brain scans. Photo by Vitaly Gariev on Unsplash.

Two doctors examine MRI brain scans. Photo by Vitaly Gariev on Unsplash.

Evan Hackstadt ‘27

Two doctors examine MRI brain scans. Photo by on .

Evan Hackstadt is a computer science major with minors in biology and math. He is a 2025-26 health care ethics intern at the Markkula Center for Applied Ethics at Ĵý. Views are his own.

 

Artificial Intelligence (AI) is becoming . Complex machine learning models such as deep neural networks have achieved impressive accuracy at clinical support tasks, from disease detection to personalized treatment planning. However, the black box nature of these models raises a number of concerns – namely, is it ethical to use a model that cannot explain its decisions for care? Explainability techniques have been deployed in response to this issue, but they have flaws of their own that threaten the duty of clinicians and the rights of patients. What level of interpretability should we require for clinical AI? What techniques and regulations are needed to protect patients while still advancing care?

The Problem With Black Box Models in Health Care

Deep neural networks use many parameters and layers of nodes to learn complex patterns in data. They can be highly accurate, but are known to come with a number of risks: encoding systemic bias, using problematic shortcuts to make predictions, and struggling to generalize to real clinical settings. Additionally, these models are “black box” in nature: there is no way to know how the model arrived at a prediction, because its structure is too complex.

Black box clinical AI violates a patient's right to autonomy, . For example, if a clinician uses a black box model to justify treatment, they will be unable to explain how the model arrived at its prediction. Thus, fully informed consent cannot be obtained from the patient, violating autonomy and eroding trust.

Furthermore, it can be difficult to detect bias in black box models. If a clinical support model is biased, it broadly violates ethical principles of .

The consequences of error-prone black box models in health care have already been seen in various cases, such as the proprietary Epic Sepsis model. A found that the model (at that time) had worse performance than claimed by the company, created alert fatigue through false positives, and took shortcuts by predicting sepsis only based on signs that the clinician themselves already suspected sepsis (e.g. ordering a diagnostic test). With the model being a black box, there was no way to explain model predictions, allowing it to hide its problematic circular logic that it had learned.

Explainability: A Band-Aid for Black Box Models

In response to the black box nature of deep neural networks, various “explainability” methods have been developed in an attempt to capture their decision-making process (e.g. , , ). Most of these are post-hoc tests that attempt to explain model outputs after they have been generated, usually by estimating the importance of different inputs to the output. However, there are three key concerns with using explainability on individual predictions.

First, post-hoc explainability techniques are unreliable. These methods merely estimate or approximate the model’s decision-making process; , otherwise they would equal the original model. A found explainability techniques in medical imaging to have low fidelity scores and inconsistency under noisy inputs. Likewise, heatmaps and saliency maps have been when imperceptible manipulations are applied to inputs. Saliency maps often .

The second key issue with explainability is that it has no bearing on the correctness of the prediction. Even if an explanation is faithful, the model’s output could simply be wrong in the first place. But explanations are often unfaithful or uninterpretable themselves, .

This leads into the third key concern: the explainability trap. Also known as automation bias, this is the phenomenon where adding an explanation to a model’s output makes the user – despite explanations having no bearing on accuracy. Furthermore, even if a given explanation is faithful, the clinician must still make a of what the explanation means and if the prediction is trustworthy.

Explainability is Ethically Insufficient

Despite sounding like a convenient solution, explainability techniques are ethically insufficient for salvaging the ethical principles violated by black box AI models in a health care context.

Since current explainability methods are often unreliable at explaining individual predictions, this creates a risk of deception. A deceptive explanation would violate patient autonomy (informed consent and truth-telling) and nonmaleficence.

The fact that explainability methods have no bearing on correctness becomes ethically problematic when the underlying model has poor accuracy, is taking shortcuts, and/or is biased. Explainability methods may fail to expose these model flaws and thus fail to uphold the principle of justice.

The explainability trap heightens this issue: given a flawed model, adding post-hoc explainability will merely make its outputs appear more convincing – likely perpetuating systemic bias and dangerous predictions in high-risk scenarios. The explainability trap directly violates nonmaleficence, since disguising a wrong prediction would be actively doing harm to patients.

A classic case that illustrates these concerns is IBM’s Watson for Oncology. The model was supposed to be a treatment recommendation system for cancer patients, but contained a number of foundational flaws that often led to unhelpful, false, or even dangerous recommendations. For example, given a patient whose cancer had not spread to the lymph nodes, And to support its recommendation, Watson cited a study demonstrating the efficacy of this drug. In this specific example, the failure was obvious. But the ethical danger is clear: post-hoc explanations make false outputs more trustworthy. This would be even more dangerous with mathematical explainability techniques which are less interpretable to clinicians. Watson failed in both interpretability and validation.

Interpretability: A Promising Alternative

A different response to the black box problem of deep learning is to design inherently interpretable models that are not black boxes in the first place. While post-hoc explainability tries to put a bandage over a black box model, “ante-hoc” interpretability builds a model whose structure (i.e. decision-making process) can be intuitively understood. For example, a basic example of an inherently interpretable model would be a decision tree (which is like a flowchart) or a rule-based system.

It is a common assumption that interpretable models must perform worse than black box deep learning models. While this “accuracy-interpretability tradeoff” is commonly assumed, recent research has proved this to be  or  .

More importantly, interpretable models offer a number of ethical benefits. Since it is possible to see exactly how a model arrived at a given prediction, patients can receive model recommendations with fully informed consent. The risk of harm that comes with the explainability trap is mitigated. And interpretability will readily expose bias or shortcut learning that threaten the principle of justice. If an accuracy penalty does exist, this represents a small loss of beneficence, but in exchange for upholding the three other principles of nonmaleficence, autonomy, and justice.

Moving Forward: Combining Validation, Interpretability, and Explainability

When it comes to clinical AI models, the obsession with explainability must be left behind. Instead, regulations need to first prioritize rigorous validation, then inherent interpretability, and finally post-hoc explainability only for global model audits.

Rigorous validation of models, similar to phased clinical trials for drugs, have been . In fact, some have even argued that . Regardless, external validation on diverse populations is of utmost importance to confirm the reliability and fairness of clinical AI models before they are deployed.

Inherently interpretable models should be prioritized over black box models, particularly when there is little-to-no difference in accuracy. When there is a gap in accuracy, both black box and interpretable models should be made available to hospitals and clinicians.

Finally, explainability can be used by researchers for global , but never to explain individual predictions to patients.

Current and regulations on clinical AI are not promising. On January 6, 2026, . This represents an ethical crisis as clinical AI continues to be integrated into the US health care system. Rather than blindly explaining, we must shift towards rigorous validation and interpretable architectures that uphold beneficence, nonmaleficence, autonomy, and justice for all patients.

Mar 30, 2026
--