Automated Coding of Medical Diagnostics from Free-Text: the Role of Parameters Optimization and Imbalanced Classes. - TIB AV-Portal

Automated Coding of Medical Diagnostics from Free-Text: the Role of Parameters Optimization and Imbalanced Classes.

00:00

6

Related Material

Technische Informationsbibliothek (TIB)

Reis, Júlio César dos

Formal Metadata

Title

Automated Coding of Medical Diagnostics from Free-Text: the Role of Parameters Optimization and Imbalanced Classes.

Title of Series

Data Integration in the Life Sciences (DILS2018)

Number of Parts

12

Author

Reis, Júlio César dos

Contributors

License

CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/38609 (DOI)

Publisher

Technische Informationsbibliothek (TIB)

0000-0002-5190-1867 (ORCID)

1080328793 (GND)

04aj4c181 (ROR)

Release Date

Language

Content Metadata

Subject Area

Life Sciences Computer Science

Genre

Conference/Talk

Abstract

The extraction of codes from Electronic Health Records (EHR) data is an important task because extracted codes can be used for different purposes such as billing and reimbursement, quality control, epidemiological studies, and cohort identification for clinical trials. The codes are based on standardized vo-cabularies. Diagnostics, for example, are frequently coded using the Interna-tional Classification of Diseases (ICD), which is a taxonomy of diagnosis codes organized in a hierarchical structure. Extracting codes from free-text medical notes in EHR such as the discharge summary requires the review of patient data searching for information that can be coded in a standardized manner. The manual human coding assignment is a complex and time-consuming process. The use of machine learning and natural language processing approaches have been receiving an increasing attention to automate the process of ICD coding. In this article, we investigate the use of Support Vector Machines (SVM) and the binary relevance method for multi-label classification in the task of auto-matic ICD coding from free-text discharge summaries. In particular, we ex-plored the role of SVM parameters optimization and class weighting for addressing imbalanced class. Experiments conducted with the Medical Infor-mation Mart for Intensive Care III (MIMIC III) database reached 49.86% of f1-macro for the 100 most frequent diagnostics. Our findings indicated that opti-mization of SVM parameters and the use of class weighting can improve the ef-fectiveness of the classifier.

Keywords

Automated ICD Coding

Multi-label Classification

Imbalanced Classes

Data Integration in the Life Sciences (DILS2018)5 / 12

1

18:25

Converting Alzheimer’s disease map into a heavyweight ontology: a formal network to integrate data.

2

23:24

Interactive Visualization for large-scale multi-factorial Research Designs.

3

17:37

Using Machine Learning to Distinguish Infected from Non-Infected Subjects at an Early Stage Based on Viral Inoculation.

4

36:02

The Hannover Medical School Enterprise Clinical Research Data Warehouse: 5 years of experience.

5

22:24

Automated Coding of Medical Diagnostics from Free-Text: the Role of Parameters Optimization and Imbalanced Classes.

6

19:21

A learning-based approach to combine medical annotation results.

7

52:04

Invited Talk on "The de.NBI network – a Bioinformatics Infrastructure in Germany for Handling Big Data in Life Sciences.”

8

18:37

Lung Cancer Concept Annotation from Spanish Clinical Narratives.

9

22:00

Linked Data based Multi-Omics Integration and Visualization for Cancer Decision Networks.

10

31:17

Leaving no stone unturned: Using machine learning based approaches for information extraction from full texts of a research data warehouse.

11

22:37

Construction and Visualization of Dynamic Biological Networks: Benchmarking the Neo4J Graph Database.

12

24:10

Towards research infrastructures that curate scientific information: A use case in life sciences.

Automatic playback

Speech

Text

Image

00:00

Mathematical optimizationParameter (computer programming)FreewareCoding theoryDisintegrationMachine codeParameter (computer programming)Chemical equationMathematical optimizationState of matterUniverse (mathematics)XMLUMLLecture/ConferenceComputer animation

00:31

DisintegrationMoment of inertiaCountingUniverse (mathematics)Complete metric spaceSource codeService (economics)Computer animation

00:55

DisintegrationFaculty (division)InformationTelecommunicationSign (mathematics)Data analysisTerm (mathematics)Observational studySystem identificationCodierung <Programmierung>Thermodynamischer ProzessComplex (psychology)Task (computing)Error messageAutomationVirtual machineCoding theoryPhysical systemKolmogorov complexityChemical equationCorrelation and dependenceSet (mathematics)Distribution (mathematics)Machine codeRepository (publishing)Student's t-testInverse elementStatisticsSoftware testingMachine codeSemantics (computer science)Data analysisMedizinische InformatikData structureInterpreter (computing)Row (database)AutomationType theoryFreewareVirtual machineThermodynamischer ProzessMachine codeInformationChemical equationEndliche ModelltheorieNatural languageNumberSet (mathematics)Insertion loss1 (number)TelecommunicationInformation retrievalResultantHypermediaSoftware developerCASE <Informatik>FluxPhysical systemInstance (computer science)Operator (mathematics)Data modelRight angleDisk read-and-write headDot productMusical ensembleNetwork topologyDifferent (Kate Ryan album)DataflowObservational studyMetropolitan area networkSystem callOnline helpEuler anglesRoutingCovering spaceComputer animation

06:46

Virtual machineData modelSupport vector machineParameter (computer programming)Mathematical optimizationMultiplicationInstance (computer science)DisintegrationHierarchyCoding theoryCodierung <Programmierung>Process modelingEndliche ModelltheorieMereologyObservational studyMachine learningDatabaseoutputExecution unitResultantDatabaseParameter (computer programming)AlgorithmMachine codeVirtual machineMachine codeInformationMereologyDifferent (Kate Ryan album)Object (grammar)Support vector machineSoftware testingLatent heatWave1 (number)Representation (politics)TelecommunicationSystem identificationEndliche ModelltheorieData structureCartesian coordinate systemAreaBitGroup actionMetreParametrische ErregungDescriptive statisticsComputer fileTable (information)Instance (computer science)KnotComputer animation

10:31

SurgeryDisintegrationCompilation albumCommunications protocolDatabaseHeegaard splittingThermodynamischer ProzessSoftware testingSet (mathematics)Process (computing)NumberMerkmalsextraktionDigital object identifierWordDatabaseSoftware testingProcedural programmingNumberThermodynamischer ProzessRootEndliche ModelltheorieMachine codeBefehlsprozessorSign (mathematics)Limit (category theory)Process (computing)Set (mathematics)PreprocessorDivisorNetwork topologySineComputer animation

12:18

DisintegrationFrequencyToken ringMerkmalsextraktionMachine learningSupport vector machineDigital object identifierEndliche ModelltheorieConstructor (object-oriented programming)Term (mathematics)Software testingSupport vector machineBus (computing)Speech synthesisComputer animation

13:02

Data modelDigital object identifierBinary fileSupport vector machineDatabaseDisintegrationMultiplicationAlgorithmMachine learningMachine codeEndliche ModelltheorieConstructor (object-oriented programming)Support vector machineSoftware developerScanning tunneling microscopeComputer animation

13:54

Parameter (computer programming)Mathematical optimizationHeegaard splittingThermodynamischer ProzessSupport vector machineDisintegrationSubsetMaxima and minimaParameter (computer programming)Thermodynamischer ProzessMathematical optimizationSet (mathematics)Office suitePower (physics)Computer animation

14:27

Mathematical optimizationParameter (computer programming)Kernel (computing)Linear mapBasis <Mathematik>Function (mathematics)Gamma functionRadial basis functionDisintegrationWeightRange (statistics)Parameter (computer programming)Bounded variationKernel (computing)Well-formed formulaMarginal distributionPerformance appraisalForm (programming)CASE <Informatik>MetreGene clusterPlotterBitWaveSubsetAlgorithmFreewareChemical equationEndliche ModelltheorieComputer animation

16:14

DisintegrationParameter (computer programming)Maxima and minimaSet (mathematics)Validity (statistics)Combinational logicParameter (computer programming)MereologyBounded variationComputer animation

16:44

Parameter (computer programming)Maxima and minimaDisintegrationParameter (computer programming)Endliche ModelltheorieComputer animation

17:01

DisintegrationParameter (computer programming)Configuration spaceMathematical optimizationLinear mapLinearizationParameter (computer programming)Configuration spaceEndliche ModelltheorieLine (geometry)Computer animation

17:28

Mathematical optimizationParameter (computer programming)DisintegrationResultantArithmetic meanStandard deviationBitMacro (computer science)Degree (graph theory)Computer animation

17:52

Parameter (computer programming)Mathematical optimizationDisintegrationStudent's t-testStandard deviationResultantDegree (graph theory)Standard deviationParameter (computer programming)Computer animation

18:11

Parameter (computer programming)Mathematical optimizationDisintegrationResultantParameter (computer programming)MereologyPlanningSound effectBit1 (number)Mathematical optimizationChemical equationComputer animation

18:48

Single-precision floating-point formatSheaf (mathematics)DisintegrationMacro (computer science)Bounded variationFinitary relationMathematical optimizationParameter (computer programming)Kernel (computing)Linear mapRadial basis functionPredictionCorrelation and dependenceCodierung <Programmierung>Process (computing)Data modelForm (programming)Task (computing)Observational studyAutomationBinary fileSupport vector machineMultiplicationOpen setBlast waveCharge carrierCustomer relationship managementLine (geometry)Insertion lossSoftware testingResultantGreatest elementTheoryMetreMathematical analysisPerturbation theoryMessage passingSupport vector machineProcess (computing)Group actionMachine codeTerm (mathematics)Sheaf (mathematics)NumberObservational studyCASE <Informatik>PreprocessorMacro (computer science)Machine codePredictabilityKernel (computing)Parameter (computer programming)MeasurementData structureStrategy gameMathematical optimizationTelecommunicationSystem identificationComputer animation

22:05

DisintegrationNeuroinformatikProjective planeComputer animationLecture/ConferenceMeeting/InterviewXMLUML

Transcript: English(auto-generated)

00:00

I'm going to present a paper which we have conducted with my PhD student, Luis. So it's automatic coding of medical diagnostics from pre-test, the whole of parameters optimization in balanced classes. So first, we are in the universe of Campinas in Brazil.

00:22

So where are we? It's not a geography class, OK? Just in Brazil, we have the federal state, Sao Paulo. And we have Campinas here, which is quite a big city. And in this city in Brazil, we have

00:40

the universe of Campinas, which attracts, let's say, lots of companies. So due to the university in there, there is pop-up lots of startups. So it's a source for innovation. And in our university, we have about 36,000 students.

01:02

Just to give you a big picture from the University of Campinas, and you are all invited to be with us there. So in our paper, we are concerned with electronic health records, which is used to register health information about patients

01:21

or physicians, insert data regarding the patient diagnostics. And in electronic health records, there is two types of data, structured data, which is laboratory results, for instance, and unstructured data, which is free test notes,

01:41

which is our interest in this study. Well, most of this unstructured data is testable documents. They give you, let's say, they have advantage giving great autonomy to physicians to search to register clinical information regarding the patient.

02:01

But if you want to extract some information from free test, this gives us some issues for data analytics. Well, actually, what we are trying to do here is to extract ICD codes automatically from this unstructured data.

02:22

So we have free test inserted by physicians, and we want to identify, extract the ICD codes from this natural language test. And this ICD code is very important for different purpose. For instance, billing and reimbursement

02:42

regarding the patient, quality control, epidemiological studies. And also, it gives a kind of semantic contest or semantic interpretation for the document. So for instance, we can create better information retrieval systems to retrieve and rank

03:02

these medical documents. Here in particular, we are focused on the extraction of codes from discharge summaries, which is one of the main documents in the electronic health records that physicians use to give information

03:23

about the patient regarding, for instance, the diagnostics, the treatment that the patient have been in the hospital, which was the main example exams that the patient have done in the hospital. And this is a free test document.

03:44

Well, actually, this process of creating ICD codes from discharge summaries usually manually performed by trained professional coders. Let's see. And this manual assignment of this code

04:03

is very complex and time-consuming process. So you can imagine a hospital with many patients. And these need to be analyzed by hand for these professional coders and to know the ICD structure and the codes to assign manually

04:22

these codes to these documents. So what we want is to create, to investigate automated systems that can help coding these ICD codes in these electronic health records.

04:42

Well, this test can be performed using machine learning approaches. But this has some challenges by using this kind of methods. First, since we are dealing with a free test, we can have misspellings or abbreviations,

05:00

which are not standardized. We have lots of number of ICD codes that we need to assign to the documents. Also, it has a kind of class imbalance because we have several documents

05:21

with some types of codes and lots of other documents with a few only codes. So we have a kind of embellishment of the classes to create the classes and create the models for the classifiers.

05:42

And also, since we are dealing with natural language, we need to create a feature set, which is usually very large because we need to handle the text in the document. Just to give an overview about this class imbalance problem.

06:04

So if you consider the MIMIC3, the repository, which is the one of the data sets we are considering here, and the discharge summaries, well, we have three ICD codes, which the top ones used

06:21

for the coders, which is about 37% of the records. And also, if you get the 100th ICD, most frequent ICD code, it's only presenting 2% of the records.

06:40

So it's not balanced, the used codes in these documents. So our objective here is to construct model based on machine learning method for automatic extraction of these codes from the discharge summaries. And here, we want to explore multi-label classification,

07:01

which is given a document. We want to give you more than one code. We want to be able to assign more than one ICD code for this document. To this end, we investigate the support vector machines for this task. But more specifically, our contribution here

07:21

is to investigate how the parameters of this machine learning algorithm play a role in the accuracy, in the precision of the identification codes. And also, to handle this embellishment of classes

07:44

via class waving. Well, this kind of extraction of codes in electronic records, there is some works in literature, which has achieved some nice results.

08:03

But for instance, there is some specific applications, for instance, here in hydrology. We also have other work using erratically based support vector machines, which is one of our approaches which we need to compare in this work.

08:27

And the other works that try to explore different test representations to create the models, which you have obtained, let's say, a concurrent F1 score.

08:41

And also, other ones which improves a little better these results using deep learning approaches, but only use some parts of these discharge summaries, like the final discharge description. Well, in our work, we are going to use the whole test

09:02

of the discharge summaries. And we are going to investigate how the parameters in SVM play an important role to optimize and to get better results for this task. Well, we consider the MIMIC-free database.

09:22

And this database, it is open, available from hospital in Boston. And this is information regarding intense care units in the hospital. So we had, in the beginning, about 55,000 discharge

09:44

summaries and about almost 700 different diagnostic codes. But we consider only the 100 most frequent diagnostics. So in our investigation, we didn't consider the whole

10:05

diagnosis codes available. Well, and we selected the discharge summaries that had at least one of these 100 most frequent diagnostics. And at the end, we end up with these 53,000 discharge

10:27

summaries, which is the input for our work. Well, actually, here, we create a vector, which is our ground proof, which is for each discharge summary we are considering.

10:41

We have the 100 codes, ICD, nine codes. And we have one or zero true or false for which one applies, is assigned from this database to reach out on the summary test. So we use it as our ground truth.

11:03

So in our experimental design, we have, as input, this collection of the database. And we have here 8% for training the models and 20% for testing the models.

11:22

This is our procedure to split the data. And in this split, we did it in a stratified manner in a way that we keep the proportion of the classes in these both sets. So since we have the collection of the database,

11:44

our first step was test pre-processing, in which we removed stop words, which is empty words. We did a process of lemmatization to get the root of the words.

12:03

We removed numbers and also other special characters. So given this pre-processing, we create our feature extraction, in which we create a vocabulary, which is a vector, that we extracted

12:26

the terms from the tests. And we used TF-IDF. But we removed most frequent terms and those terms, which is not so frequent,

12:43

because they don't help us in the construction of the model. So at the end, our vector, our feature vector, presents more than 12,000 tokens, which were extracted from the documents.

13:04

Well, since we created the vector, we applied the classification model construction, which is SVM. But here, since we applied a multi-label classification,

13:21

what do we need here? We have 100 classes, ICD possible classes. So our approach here was to create one classifier for each class. So at the end, we're going to have 100 classifier. And we are going to see for each document, which

13:43

one applies in our relevance, if its code applies or not for the document. But here, to create these 100 classifiers, we create a process to optimize the parameters of SVM.

14:03

So to this end, we got 30% of these first training set, in which, based on these 30% of the documents, we used it for parameters value searching. So based on this data set, we are

14:22

going to use to train the SVM and optimize the parameters. So to this end, in the SVM, we have three main parameters, which

14:41

is the kernel, which can be linear or non-linear, the parameter C, which is the margin of this kernel, which helps the behavior of the algorithm, and the gamma, which only apply for the RBF kernel.

15:01

So in our approach, in the second subset, we created a variation of these values, of these parameters, to see for each one of the classifiers we need, which are the better parameters which

15:21

applies for those models. Also, to handle the class imbalance, we use the class waifing, which is given an initial waif. We created some modifications in this initial formula

15:42

to better handle the class waifing. So here, since we have some classes which has lots of examples, and other classes which are very poor of examples, we modified this initial waif to better waif the classes.

16:05

And in this end, those classes, which don't have so much examples, will not be penalized. So based on it, retraining for each class on this second training set.

16:24

So the performance of each parameter combination was tested on the validation set. And based on when we achieved the best parameters for each classifier, we create the classifier based on this first training set.

16:43

So this part here was only to figure out which are the best parameters for the classifiers. And also, we got this first training class to create the models based on these best parameters.

17:03

Also, to have a kind of baseline configuration, we created the models also in a configuration which we arbitrarily used the kernel linear and the C parameter as one. Just to have a baseline to compare with our approach,

17:27

searching for the better parameters. So here we have the results for the baseline. So here, since we have 100 classes, this is the mean of the results for each class.

17:42

So F1 macro, we got about 33%. And a little bit high standard deviation. So when we got the results for our approach, which we created, we detected the best parameters

18:03

for each one of the classes, we improved this value and decreased also the standard deviation. Well, if you see the results for the worst result, the worst results for the 20 classes

18:22

without the parameters optimization, we saw that they are not good ones. But this improves a little bit for the classification with parameter optimization. But these classes present the worst effectiveness.

18:42

They correspond to those which are more imbalanced, which is a hard problem to solve. And here we have the results for the five best classes. So for each code, we have the depth measure here, which we got quite good results.

19:03

Well, one interesting analysis that we conducted is that when we increase the number of classes we need to consider, so here we have the 100 classes which we got the F1 macro of 49%. So when we increase it, we decrease the F1 macro,

19:24

which means that as most classes we consider here, we decrease the results. So some key findings of this contribution. We found that it's important to perform this parameter

19:43

optimization. Most of the works in this line only use, for instance, the linear kernel. But we found that here in most of the classes we got better results using the RBF kernel, which

20:02

is surprising. And it's interesting to know that. And also, we saw that our strategy to use the class waiving improved the results. And only two classes performed better without the need of using class waiving. So as future work, we wanted to use

20:23

the structure of UMLS concepts in the test pre-processing. For instance, in the way of, in the case of using the 12,000 terms for the feature extraction, only use those from UMLS, for instance,

20:43

that we detect then in UMLS, and to see how we can handle that. And also understand the whole of diagnostic correlation, the prediction of ICD codes. For instance, there is some codes, ICD codes which is very correlated, one with the other.

21:01

So for instance, in a document, if there is the ICD-9 code hypertensive disease, it is highly correlated with heart disease. So how we can take this into account in the identification of the codes? Well, the take-home message.

21:22

So here in this work, our goal was to create automatic coding from three tests of electronic records, most specifically in discharge summaries. So we investigated the use of support vector machines and how to deal with the optimization of the parameters.

21:45

And also, we propose how to handle multi-class classification in this task. And finally, our study has shown that considering parameter value search in the user class, we can bring improvements to the automatic coding task,

22:03

which is the most contribution of this work. So I acknowledge some colleagues in computing and FAPAS for the grant in this project. And thank you so much for attention.