Synthetic
Patients

virtual patient chatbots for medical
training and simulation
Chat with synthetic patientCode repositoryPre-printContact us

Implementation Details

From our preprint's technical appendix:

In this study, we present a novel approach to simulating difficult conversations using realistic AI avatars. This intervention involved three primary steps: construction of patient narrative histories, generation of patient multimedia, and integration with a custom video chat application. This document describes the methodologies of these section in greater technical detail.

Construction of patient profiles
As described in the primary manuscript, patient profiles were generated based on fictional clinical scenarios and patients. These were combined with a common set of instructions to form the overall prompt, and this prompt was passed to a language model to generate the synthetic patient chatbots. From a technical perspective, we utilized the GPT4 Assistant API (Open AI, San Francisco, CA). Temperature was set at 0.8 to generate a variety of responses. For initial testing, we used the custom GPT feature, which has similar functionality to the Assistant API but allowed for more rapid prototyping.

Image generation
To generate images and videos of each synthetic patient, multiple imaging models were employed. After developing the patient profiles, we generated images for the patients using Midjourney (version 5, San Francisco, CA) and Stable Diffusion XL (Stability AI, San Francisco, CA). To refine images while maintaining the continuity of the character, we used generative infilling and extracted physical characteristics (via low-rank adaptations, or LoRAs) within Midjourney. Additional fine details were added with Photoshop (Adobe, San Francisco, CA).

Audio generation
To develop the voices for synthetic patients, we identified royalty-free audio clips of individuals speaking. We collated these files and cleaned them in audio processing software (Ableton Live, Berlin, Germany) and cloned them using ElevenLabs voice cloning tool (ElevenLabs, New York, New York). Manual iterative adjustments were made to the stability, similarity, and style parameters until voices were stable and realistic.


Video generation
To generate video of the patient responding to queries, we started with the images created above. These were then used as inputs to multiple video-generation tools, including Pika Labs (Pika Labs, Palo Alto, CA), Runway Gen2 (Runway AI, New York, NY), and Stable Diffusion Video (Stability AI, San Francisco, CA) to generate body/head sway movements. These models produced realistic movements of patient’s heads and bodies, but distorted the patients’ face significantly.

To rectify this, we pivoted to using lip-syncing-focused software for the facial animations. We tested options from D-ID (Tel Aviv, Israel), Synthetsia (London, UK), and HeyGen (Los Angeles, CA), finding that HeyGen produced the most realistic animations, though generation times were substantial (10-20min). This prolonged generation time would limit the ability of these tools to serve as “real time” applications. However, HeyGen does offer a real-time avatar option, though the processing requirements are substantial and thus costs associated with this plan were beyond the budget of this project. Notably, our demonstration video (https://doi.org/10.6084/m9.figshare.25930861) utilizes HeyGen lip-syncronization technology, and has been edited to demonstrate what a real-time avatar may offer.

Alternative open source lip-syncing models were explored to allow for increased speed. We evaluated wav2lip and its improved versions such as wav2lip with generative adversarial network (GAN), and sync1.6 (SyncLabs, San Francisco, CA), finding that the wav2lip+GAN version had adequate quality with reasonable generation times (20-30 seconds on a consumer-grade desktop).

A drawback to most of the lip-syncing models was a relative lack of body animation, making the avatars appear to be lifeless “talking heads.” To resolve this, we combined the body sway videos generated with Runway Gen2 with the lipsyncing footage from HeyGen via a feathered overlay in Adobe Premiere (Adobe, San Francisco, CA). This produced a video of the patient subtly shifting their body and speaking some example text with high-quality lip-syncing. This video then served as our base video on which to perform dynamic lip-syncing via the web application.

Web application
To facilitate real-time interaction with the synthetic patient, we developed a simple web application using Python (Python Foundation, Beaverton, OR). Within this application, users access a mock telehealth interface displaying a video feed of the patient awaiting the provider’s question. Users initiate communication by pressing a speech button, allowing their question to be recorded. This audio is subsequently transcribed into text using a voice-to-text model, WhisperAPI (Open AI, San Francisco, CA). The text is then sent to the OpenAI inference API, which generates a text response. The response is then converted into audio using the patient’s predetermined voice. To lip-sync the audio to the patient video, the aformentioned base video is used as the template. This base video and the generated audio are passed to the lipsync engine. This generates an audiovisual clip that appears as if the patient is directly responding to the provider's inquiry. This clip is then integrated into the live video feed, allowing for a seamless interaction.

To permit individuals to experiment with our platform, we developed a containerized version of the application using Docker. Instructions for installation have been deposited in a HuggingFace repository and are available at https://doi.org/10.57967/hf/2338. Though the application size is approximately 5GB, containerization allows the software to be run on almost any desktop. Users will need accounts and API keys from OpenAI and ElevenLabs. Because of the limitations of audio input-output within the containerized application, the audio-to-text functionality is not available. Instead, users will need to provide their questions in the form of text.


\clearpage

\section*{Appendix 2: Example of Encounter with Synthetic Patient} \label{appendix2}

Video file available at \href{https://doi.org/10.6084/m9.figshare.25930861}{}