Teaching Artificial Intelligence to Ask Clinical Questions
Researchers have made progress toward machine-learning algorithms that can aid physicians in rapidly locating patient health records and information.
The cumbersome nature of electronic health records makes it difficult for physicians to search for information that can help them make treatment decisions. Even when doctors are properly trained to use an electronic health record (EHR), it can take an average of eight minutes to find an answer to a single question. As a result, physicians can spend less time with patients and providing treatment and more time navigating an often clunky EHR interface.
A number of researchers have developed machine-learning models that can automate the process of finding the information physicians need in an electronic health record. Well-processed data, however, requires large datasets of relevant medical questions to annotate, which are often difficult to acquire because of privacy concerns. In addition, a number of existing models have difficulty generating authentic questions that a doctor would ask and are often unable to provide correct answers.
To address this data shortage, researchers at MIT partnered with medical experts to study the questions physicians ask when reviewing electronic health records. Their next step was to collect more than 2,000 clinically relevant questions written by these medical experts and make them available for public use and annotation.
A machine-learning model trained using the dataset generated high-quality and authentic clinical questions more than 60% of the time, compared to real questions from medical experts.
To improve the accuracy of finding sought-after information in patient records, they plan to generate vast numbers of authentic medical questions using this dataset and then train a machine-learning model to assist doctors in locating key information in patient records more efficiently.
The number of questions may appear excessive. However, when you consider that machine-learning models are being trained today with tons of data, it is not excessive at all. There is a lack of data when training machine-learning models to work in healthcare settings, stated lead author Eric Lehman, a graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL).
Training models requires realistic data relevant to the task but is difficult to find or create.
Lack of Data
Lehman explains that the few large datasets of clinical questions that the researchers were able to locate had a number of issues. Patients posed some questions on online forums, which differ greatly from those by physicians. As many of the questions in other datasets are based on templates, they are mostly identical in structure, making them unreliable.
MIT researchers built their dataset by interacting with practicing physicians and medical students in their last year of training. A total of more than 100 EHR discharge summaries were provided to the medical experts, who were instructed to read through the summaries and ask questions. In an effort to gather natural questions, the researchers did not restrict the types or structures of questions. In addition, the experts were asked to identify the "trigger text" in the EHR that led them to ask each question. For example, a medical expert may read a note in an electronic health record indicating a patient's history of prostate cancer and hypothyroidism. In response to the trigger text "prostate cancer," the expert may ask questions such as "date of diagnosis?" or "has there been any treatment?"
The researchers discovered the majority of inquiries concerning symptoms, therapies, or test findings. While these results were not surprising, quantifying the number of questions pertaining to each major subject would aid in developing a clinically applicable dataset.
After collecting their dataset of questions and associated trigger language, they utilized it for training machine-learning models to formulate new questions depending on the trigger text.
Then, medical experts determined whether those questions were "good" using four metrics: comprehension, triviality, medical relevance, and relevance to the trigger (is the trigger related to the question?).
Worrisome Factors
The researchers discovered that when a model was given trigger language, it generated a good question 63% of the time, whereas a physician would ask a good question 80% of the time.
Using the publicly accessible datasets discovered at the commencement of this study, they also trained models to retrieve answers to clinical inquiries. These trained models were then tested to see whether they could answer "excellent" questions posed by medical specialists. Unfortunately, the algorithms could not recover about 25% of the responses to physician-generated questions.
This outcome is extremely troubling. Lehman explains that what individuals believed to be high-performing models were, in reality, abysmal because the assessment questions on which they were tested were subpar. The team is now using this work toward their first objective: developing a model that can automatically respond to physicians' EHR inquiries. The next step is to utilize their dataset to build a machine-learning model that can automatically produce thousands or even millions of high-quality clinical questions to be subsequently used to train a model for automatic responses to questions.
Src: MIT news
Comments ()