Teaching artificial intelligence to ask clinical questions
Researchers have made progress toward machine-learning algorithms that can aid physicians in rapidly locating patient health record information.
The cumbersome nature of electronic health records makes it difficult for physicians to search for information that can help them make treatment decisions. Even when doctors are properly trained to use an electronic health record (EHR), it can take them an average of eight minutes to find an answer to one question.
Physicians can spend less time with patients and providing treatment if they spend more time navigating an often clunky EHR interface.
A number of researchers have developed machine-learning models that can automate the process of finding the information that physicians need in an electronic health record. Well-trained data, however, requires large datasets of relevant medical questions to annotate, which are often difficult to acquire because of privacy concerns. A number of existing models have difficulty generating authentic questions, that is, questions that a human doctor would ask, and they are often unable to provide correct answers.
To address this data shortage, researchers at MIT partnered with medical experts to study the questions physicians ask when reviewing electronic health records. Their next step was to collect more than 2,000 clinically relevant questions that were written by these medical experts and make them available for public use and annotation.
A machine-learning model trained using the dataset generated high-quality and authentic clinical questions more than 60 percent of the time, compared to real questions from medical experts.
In order to improve the accuracy of finding sought-after information in patient records, they plan to generate vast numbers of authentic medical questions using this dataset and then train a machine-learning model that will assist doctors in finding sought-after information in patient records more efficiently.
The number of questions may appear excessive. However, when you consider that machine learning models are being trained today with so much data, perhaps billions, that appears to be a huge amount. There is a lack of data when training machine-learning models to work in health care settings, stated lead author Eric Lehman, a graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL).
The training of models requires realistic data that is relevant to the task, but is difficult to find or create.
Lack of data
Lehman explains that the few large datasets of clinical questions that the researchers were able to locate had a number of issues. Some of the questions were posed by patients on online forums, which differ greatly from those posed by physicians. As many of the questions in other datasets are based on templates, they are mostly identical in structure, making them unreliable.
MIT researchers built their dataset by interacting with practicing physicians and medical students in their last year of training. A total of more than 100 EHR discharge summaries were provided to the medical experts who were instructed to read through the summaries and ask any questions they may have. In an effort to gather natural questions, the researchers did not restrict the types or structures of questions. In addition, the experts were asked to identify the "trigger text" in the EHR that led them to ask each question.
For example, a medical expert may read a note in an electronic health record indicating that a patient has a history of prostate cancer and hypothyroidism. In response to the trigger text "prostate cancer," the expert may ask questions such as "date of diagnosis?" or "has there been any treatment?"
They discovered that the majority of enquiries concerned symptoms, therapies, or test findings. While these results were not surprising, quantifying the amount of questions pertaining to each major subject would aid in the development of a clinically applicable dataset.
After collecting their dataset of questions and associated trigger language, they utilized it to train machine-learning models to formulate new questions depending on the trigger text.
Then, medical experts determined whether those questions were "good" using four metrics: understandability, triviality, medical relevance, and relevance to the trigger (Is the trigger related to the question? ).
Worrisome factors
The researchers discovered that when a model was given trigger language, it generated a good question 63% of the time, whereas a human physician would ask a good inquiry 80% of the time.
Using the publicly accessible datasets discovered at the commencement of this study, they also trained models to retrieve answers to clinical inquiries. Then, these trained models were put to the test to see whether or not they could answer "excellent" questions posed by human medical specialists.
About 25% of the responses to physician-generated questions could not be recovered by the algorithms.
This outcome is really troubling. Lehman explains that what individuals believed to be high-performing models were, in reality, abysmal because the assessment questions on which they were tested were subpar. The team is now using this work toward their first objective: developing a model that can automatically respond to physicians' EHR inquiries. Next, they will utilize their dataset to build a machine-learning model that can automatically produce hundreds or millions of high-quality clinical questions, which will subsequently be used to train a model for automatic question answering.
Src: MIT news
Comments ()