Automated extraction of information from free text of Spanish

Colomb Medica · Jan 2023

Automated extraction of information from free text of Spanish oncology pathology reports.

Pathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is essential in implementing and establishing a hospital-based-cancer registry. ⋯ A preliminary algorithm validation against human extraction was performed over a small set of reports with satisfactory results. This shows that a regular-expression approach can accurately and precisely extract multiple specimen attributes from free-text Spanish pathology reports. Additionally, we developed a website to facilitate collaborative validation at a larger scale which may be helpful for future research on the subject.

explore further… or not…
- Diana Marcela Mendoza-Urbano, Johan Felipe Garcia, Juan Sebastian Moreno, Juan Carlos Bravo-Ocaña, Alvaro José Riascos, Angela Zambrano Harvey, and Sergio I Prada.
- Universidad Nacional de Colombia, Facultad de Medicina, Departamento de Patología, Bogotá, Colombia.
- Colomb Medica. 2023 Jan 1; 54 (1): e2035300e2035300.
BackgroundPathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is essential in implementing and establishing a hospital-based-cancer registry.ObjectiveThis study aimed to describe implementing a natural language processing algorithm for oncology pathology reports.MethodsAn algorithm was developed to process oncology pathology reports in Spanish to extract 20 medical descriptors. The approach is based on the successive coincidence of regular expressions.ResultsThe validation was performed with 140 pathological reports. The topography identification was performed manually by humans and the algorithm in all reports. The human identified morphology in 138 reports and by the algorithm in 137. The average fuzzy matching score was 68.3 for Topography and 89.5 for Morphology.ConclusionsA preliminary algorithm validation against human extraction was performed over a small set of reports with satisfactory results. This shows that a regular-expression approach can accurately and precisely extract multiple specimen attributes from free-text Spanish pathology reports. Additionally, we developed a website to facilitate collaborative validation at a larger scale which may be helpful for future research on the subject.Copyright © 2023 Colombia Medica.

Pubmed Copy Citation Plaintext

Add institutional full text...
Notes
Knowledge, pearl, summary or comment to share?

300 characters remaining

help

You can also include formatting, links, images and footnotes in your notes

Simple formatting can be added to notes, such as *italics*, _underline_ or **bold**.

Superscript can be denoted by <sup>text</sup> and subscript <sub>text</sub>.

Numbered or bulleted lists can be created using either numbered lines 1. 2. 3., hyphens - or asterisks *.

Links can be included with: [my link to pubmed](http://pubmed.com)

Images can be included with: ![alt text](https://bestmedicaljournal.com/study_graph.jpg "Image Title Text")

For footnotes use [^1](This is a footnote.) inline.

Or use an inline reference [^1] to refer to a longer footnote elseweher in the document [^1]: This is a long footnote..
hide…

Automated extraction of information from free text of Spanish oncology pathology reports.

Notes

300 characters remaining

help

You can also include formatting, links, images and footnotes in your notes