-
Notifications
You must be signed in to change notification settings - Fork 0
/
sample-1col.tex
155 lines (127 loc) · 6.42 KB
/
sample-1col.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
%% The first command in your LaTeX source must be the \documentclass command.
%%
%% Options:
%% twocolumn : Two column layout.
\documentclass[
% twocolumn,
]{ceurart}
%%
%% One can fix some overfulls
% \sloppy
\usepackage{todonotes}
\usepackage{microtype}
\usepackage{float}
%%
%% end of the preamble, start of the body of the document source.
\begin{document}
%%
%% Rights management information.
%% CC-BY is default license. Do not change!
\copyrightyear{2021}
\copyrightclause{Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).}
%%
%% This command is for the conference information
%\conference{IberLEF 2020, September 2020, Málaga, Spain}
\conference{Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021)}
%%
%% The "title" command
\title{Codestrange at eHealth-KD Challenge 2021}
%%
%% The "author" command and its associated commands are used to define
%% the authors and their affiliations.
\author[1]{Roberto Marti}[%
orcid=0000-0002-7671-4187,
]
\author[1]{Carlos Bermudez}[%
orcid=0000-0002-5267-0327,
]
\author[1]{Leonel García}[%
orcid=0000-0001-9916-2103,
]
\author[1]{Leynier Gutierrez}[%
orcid=0000-0001-6393-1679,
]
\address[1]{University of Havana, Havana, Cuba}
%%
%% The abstract is a short summary of the work to be presented in the
%% article.
\begin{abstract}
The abstract should briefly summarise the contents of the paper in 150--250 words.
\end{abstract}
%%
%% Keywords. The author(s) should pick words that accurately describe
%% the work being presented. Separate the keywords with commas.
\begin{keywords}
eHealth \sep
Knowledge Discovery \sep
Natural Language Processing \sep
Machine Learning
\end{keywords}
%%
%% This command processes the author and affiliation and title
%% information and builds the first part of the formatted document.
\maketitle
% \setcounter{page}{XX} We will let you know which page counter to use for the camera ready version
\pagestyle{plain}
\section{Introduction}
This paper want to discuss the Codestrange Team´s approach to the eHealth-KD Challenge 2021~\cite{overview_ehealthkd2021}. The main focus was directed to Task A. This task was about Entity Recognition. For this purpose the team decided to use Spacy~\cite{spacy} toolkit. This was done because this tool contains some trained models. This models already contains a workflow for tokenization, parsing, sentence recognition, entity recognition, etc. This entity recognition step was the one used for the Task A approach of the team.
\section{System Description}
\begin{itemize}
\item Architecture \newline
The base we used the spacy workflow $es\_core\_news\_lg$. This is the large workflow trained for the Spanish language. In this workflow we replaced the $ner$ related model with a custom one. This custom model was changed to use the labels provided for entities in the datasets. Also the the dropout rates were set to $0.28$.
\item Input handling \newline
The input was given to the model as it is. No additional operations were made on the input.
\item Output handling \newline
The output of the workflow were the recognized entities. With this outputs we performed three main operations:
\begin{itemize}
\item Multiple words entities
One of the drawbacks of the used model was that the entities given by the $ner$ consist on only one words. Concepts such as $"cancer de pulm\'on"$ were splited into three different entities. We attempted to fix this issue by merging continuous entities with the same entity type given by the model.
\item Stopwords removal
Another issue was that the Spanish stopwords weren't removed from text and also taken as Concepts in many cases even if they didn't form part of an actual Concept. To fix this if a given stopword was labeled by the model as a Concept, and, this stopword wasn't part of a larger Concept as in the previous case, this entity was remove from the output.
\item Numbers removal
Finally there were some errors when the model recognized numbers in the text as entities. This cases were removed from the output.
\end{itemize}
\item System training \newline
The model was trained upon the Google Colab\cite{GoogleColab} platform. Using the benefits from the free plan and storing the results in the local Google drive. Python 3 Google Compute Engine backend.
The team used the train corpora for training purposes and the develop corpora for validation. The sandbox server was also used for result validation prior to the official server submission.
Cross Validation with spacy.
\item Pretrained word embeddings. \newline
The pipeline used with spacy are trained over news data. ( pipeline components ).
\end{itemize}
\section{Results}
The model results were:
\begin{itemize}
\item Scenario 1 \newline
\begin{itemize}
\item F1 - $0.23201$
\item Precision - $0.33703$
\item Recall - $0.17689$
\end{itemize}
\item Scenario 2 \newline
\begin{itemize}
\item F1 - $0.08019$
\item Precision - $0.415$
\item Recall - $0.04439$
\end{itemize}
\item Scenario 3 \newline
\begin{itemize}
\item F1 - $0.03275$
\item Precision - $0.4375$
\item Recall - $0.01701$
\end{itemize}
\end{itemize}
Since the only attempted scenario was Scenario 2, the other scores correspond to the baseline model. In this second scenario we achieved a $0.41$ precision that was in some cases higher than others teams, but the many missing concepts in the outputs, indicated by the low recall, were the cause of the low score.
\section{Conclusions}
For conclusions the team knows it has to work on a better entity recognition model to allow the recognize more entities and thus improve the low recall scores. This could perhaps be achieved by the usage of some pretrained word embeddings such as BERT or another related to the health topic. Also work towards a model that doesn't need many outputs optimizations or at least not difficult ones. We also consider that an attempt to Task B could be neccessary for an overrall improvement of the model scoring.
%%
%% Define the bibliography file to be used
\bibliography{sample-ceur}
\end{document}
%%
%% End of file