Data and text mining of cancer symptoms and comorbidities in electronic patient records in the Nordic languages, MINECAN

Project Leader: Hercules Dalianis

PhD Student:  Rebecka Weegar

There are a number of E-science tools, such as C1. Patient record text mining applied for research on screening or B1. Nordic biobank registry, which have been proposed within the NIASC work plan. We are going to address this in our proposal, using text mining to convert unstructured into structured information. Our results will also address the requirements on open E-science tools. In our proposal, we have connected data and text mining techniques from Sweden and Denmark to stakeholders in Sweden, Denmark and Norway.
The PhD project is split into a number of subprojects. The first subproject comprises cervical cancer while the second subproject aims at prostate cancer. These subprojects constitute the main focus of the entire PhD project.
Cervical cancer (ICD-10 code C.53) develops in tissues of the cervix and is, in the majority of cases, caused by human papillomavirus (HPV) infection. Cervical cancer is difficult to diagnose early due to vague symptoms. When women discover symptoms the cancer has often already spread. One goal of this subproject is to apply text mining methods to identify early symptoms of cervical cancer, in form of either known or unknown symptoms. With help of such an established symptom spectrum, cytology screening could be complemented by questionnaires in order to strengthen the diagnosis. Prostate cancer (ICD-10 code C.61) is the most common cancer diagnosis. In 2009, it was the second most common cause of cancer death in the Nordic countries. Accounting for 40% of prevalent male cancers in the region, the health burden and costs are substantial. Prostate biopsies, where samples of tissue are removed from the prostate, are associated with adverse outcomes and substantial costs. The goal of this subproject is to identify comorbidities, i.e., additional disorders (or diseases) that co-occur with a primary disease or disorder, associated with biopsies.
The third subproject targets the transfer of free-text pathology referrals into a structured format in registers using text mining and machine learning techniques.