Sökning: id:"swepub:oai:DiVA.org:su-189143" >
The Impact of De-id...
The Impact of De-identification on Downstream Named Entity Recognition in Clinical Text
-
- Berg, Hanna (författare)
- Stockholms universitet,Institutionen för data- och systemvetenskap
-
- Henriksson, Aron (författare)
- Stockholms universitet,Institutionen för data- och systemvetenskap
-
- Dalianis, Hercules (författare)
- Stockholms universitet,Institutionen för data- och systemvetenskap
-
(creator_code:org_t)
- USA : Association for Computational Linguistics, 2020
- 2020
- Engelska.
-
Ingår i: The 11th International Workshop on Health Text Mining and Information Analysis LOUHI 2020. - USA : Association for Computational Linguistics. - 9781952148811 ; , s. 1-11
- Relaterad länk:
-
https://doi.org/10.1...
-
visa fler...
-
https://su.diva-port... (primary) (Raw object)
-
https://www.aclweb.o...
-
https://urn.kb.se/re...
-
https://doi.org/10.1...
-
visa färre...
Abstract
Ämnesord
Stäng
- The impact of de-identification on data quality and, in particular, utility for developing models for downstream tasks has been more thoroughly studied for structured data than for unstructured text. While previous studies indicate that text de-identification has a limited impact on models for downstream tasks, it remains unclear what the impact is with various levels and forms of de-identification, in particular concerning the trade-off between precision and recall. In this paper, the impact of de-identification is studied on downstream named entity recognition in Swedish clinical text. The results indicate that de-identification models with moderate to high precision lead to similar downstream performance, while low precision has a substantial negative impact. Furthermore, different strategies for concealing sensitive information affect performance to different degrees, ranging from pseudonymisation having a low impact to the removal of entire sentences with sensitive information having a high impact. This study indicates that it is possible to increase the recall of models for identifying sensitive information without negatively affecting the use of de-identified text data for training models for clinical named entity recognition; however, there is ultimately a trade-off between the level of de-identification and the subsequent utility of the data.
Ämnesord
- NATURVETENSKAP -- Data- och informationsvetenskap -- Språkteknologi (hsv//swe)
- NATURAL SCIENCES -- Computer and Information Sciences -- Language Technology (hsv//eng)
Nyckelord
- data- och systemvetenskap
- Computer and Systems Sciences
Publikations- och innehållstyp
- ref (ämneskategori)
- kon (ämneskategori)
Hitta via bibliotek
Till lärosätets databas