Katechon, Improving Oncological AI Models
Fusing metadata and images // Katechon // Open-source contribution
The push to improve machine-learning diagnoses has led researchers to look beyond images alone, integrating metadata such as patient age, sex, genetic markers, and more to build holistic patient profiles.
Zhang et al. explored this idea in their 2022 paper “Multimodal learning for reliable visual-assisted diagnosis”, where they proposed a model architecture that fuses image data and metadata through mutual attention. Rather than treating metadata as a secondary input, the model lets both data streams actively inform each other during training.
Unfortunately, the paper did not include source code, so we reimplemented the architecture from scratch. It is now publicly available as an open-source model on our GitHub.
How it works
The model has three components: an image encoder, a metadata encoder, and a mutual-attention decoder.
The image encoder is a Vision Transformer (ViT) pretrained on ImageNet. We freeze the early layers and fine-tune the final five, so the model retains its general visual understanding while learning domain-specific features — in our case, skin lesion classification.
The metadata encoder passes structured patient data through a soft label encoder and a small fully connected network. Soft label encoding replaces the zeros in one-hot vectors with 0.01, ensuring all neurons in the network activate during training rather than just one.
Both encoders produce learned feature vectors: one representing the image and the other representing the metadata. These are then fed into the mutual attention block.
Mutual attention
The mutual attention block is where the two streams interact. In a standard attention mechanism, queries, keys, and values all come from the same input. In mutual attention, the query vectors are swapped: the image features generate queries that are applied to the metadata’s keys, and the metadata features generate queries that are applied to the image’s keys.
To illustrate, imagine two expert dermatologists who have separately studied the images and the metadata for a given lesion. The one who studied the metadata asks the other a set of questions (queries). The dermatologist who studied the images answers those questions (keys) based on the knowledge acquired from the images. Then they switch roles, with each modality interrogating the other.
The original feature vectors are then concatenated to the respective outputs of the attention blocks. This is known as a skip step, and it preserves the original input context after the attention transformation. The combined outputs are then concatenated together and passed through a fully connected network to produce a final prediction.
Why we built it
We first encountered this architecture during the ISIC 2024 Kaggle competition. The dataset contained 400,000 low-resolution skin lesion images; image-only models quickly reached a ceiling, and metadata was expected to make the difference. Mutual attention was the right approach, but no public implementation existed.
The full implementation is on GitHub. It is intended as a working starting point: replace the backbone, adjust the attention heads, and change the network depth. The code is there for you to build on.
Conclusion
Mutual attention offers a clean approach to multimodal learning, especially when different data types don’t merely coexist but actively improve one another. In domains such as medical imaging, where a diagnosis rarely depends on a single source of information, such integration can be critical to achieving human-level performance.
If you find it useful, have questions, or want to contribute, you can find us on GitHub.

