ISSN: 2155-9570
Research Article - (2016) Volume 7, Issue 4
Objective: Currently 1/12 of the world’s population has diabetes mellitus (DM), many are or will be screened by having retinal images taken. This current study aims to compare the DAPHNE software’s ability to detect DR in three different European populations compared to human grading carried out at the Moorfields Eye Hospital Reading Centre (MEHRC). Participants: Retinal images were taken from participants of the HAPIEE study (Lithuania, n=1014), the PAMDI study (Italy, n=882) and the MARS study (Germany, n=909). Methods: All anonymized images were graded by human graders at MEHRC for the presence of DR. Independently, and without any knowledge of the human grader’s results, the DAPHNE software analysed the images and divided the participants into DR and no-DR groups. Main outcome measures: The primary outcomes were sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) of the DAPHNE software with regards to the identification of DR or no-DR on retinal images as compared to the human grader as reference standard. Results: A total of 2805 participants were enrolled from the three study sites. The sensitivity of the DAPHNE software was above 93% in all three studies specificity was above 80%, the PPV was above 28% and the NPV was not below 98.8% in any of the studies. The DAPHNE software did not miss any vision-threatening DR. The areas under the curve (AUC) for all three studies were above 0.96. DAPHNE reduced manual human workload by 70% but had a total false positive rate of 63%. Conclusions: The DAPHNE software showed to be reliable to detect DR on three different European populations, using three different imaging settings. Further testing is required to see scalability, performance on live DR screening systems and on camera settings different to these studies.
Keywords: Diabetic retinopathy; Automated grading; Europe; Diabetic retinopathy screening; Diabetes
Worldwide, the number of patients with Diabetes Mellitus (DM) is expected to increase from 177 million in 2000 to 366 million by 2030 [1] while the International Diabetes Federation (IDF) estimates as many as 592 million with DM by 2035. Currently, about 1/12 of the world’s population has DM [2]. Detection of Diabetic Retinopathy (DR) relies on detailed examination of the retina either using a slitlamp eye examination method or analysis of fundus images. This latter one is usually done by human graders; let it be in population based studies or in DR screening programmes. Due to the increase of imaging volume, it might not be possible to keep up with the grading demands and new solutions, such as automated image analysis, might be a viable option to safely and effectively analyse image sets. In many countries current recommendation is to screen for DR in patients with DM annually. In the publicly funded UK system about 80% of the eligible DM patients are screened annually [3], and as a results of the screening programme, DR is no longer the leading cause of blindness in working-age population [4]. Population based studies also generate large image sets, and the analysis of these is crucial for understanding trends in disease detection and progression and to inform policy in the relevant countries. With the advent of portable imaging technologies, the amount of retinal imaging generated is likely to increase exponentially. Automated detection software could be one way to minimise the impact of this increasing workload, allowing healthcare provider to focus on those patients requiring treatment.
Automated detection software has proved to be reliable in earlier studies [5,6], but there is a need to establish if these softwares work equally well in different populations and on different imaging modalities. Earlier studies mainly focused on smaller samples from one population group [7-9]. Therefore there is a need for further testing and analysis on larger samples and different population groups, and compare the results to human graders’ results on the same images. In the DR screening programmes, about 2/3rd of the patients have no DR and grading the images from these patients takes the attention away from those requiring quick turnaround in grading and timely referral. Therefore it is imperative that any automated software has excellent ability to detect no DR and is very reliable in this area, so no or minimal re-grading by human graders is required for those images with no-DR. Our aim in this study is to test the automated software’s (named DAPHNE [10] ability to detect or rule out DR on 3 different population samples generated in different parts of Europe. In order to achieve our goal, we used three population based studies as we set out to test DAPHNE’s ability to differentiate normal fundus images from those with DR.
Subject recruitment
Health, alcohol and psychosocial factors in eastern Europe (HAPIEE): Participants of the population-based study were 45-72 year-old residents of the second largest city in Lithuania (Kaunas) randomly drawn from the population register of the city (population of 352,000 residents). The original study was part of a prospective cohort study on Health, Alcohol and Psychosocial Factors in Eastern Europe (HAPIEE) [11]. A total of 7,087 individuals participated. The subjects for ophthalmological examination were randomly drawn from the main study and included a total of 1,065 participants; 1,014 participants (2024 eyes) had retinal images taken and these were used for this study. The camera used in this study was a Canon CF-60Uvi (Canon Medical Systems, USA). Informed consent was obtained from each participant. The study was approved by regional ethics committee and was carried out in accordance with the Declaration of Helsinki [12].
The prevalence of age-related macular degeneration in Italy (PAMDI): The study population was described in details elsewhere [13], but a short summary is given below. PAMDI examined 1,162 participants between 31 Oct 2005 and 31 Oct 2006 in two communities in North-East Italy: one of an urban district and the other one living in two adjacent small rural communities. Inhabitants aged 60 years or older was invited to participate in the study. A total of 885 participants underwent full ophthalmic examination. At the examination a 30° colour fundus photographs were taken of the posterior retina of both eyes according to standard methods (ETDRS field 1M) using a digital fundus camera (Topcon, Topcon Co., Tokyo, Japan). In our study all 882 participants with gradable images were used. The study was approved by the University of Padua Human Research Ethics Committee and was performed in accordance with the tenets of the Declaration of Helsinki [13].
Muenster aging and retina study (MARS): The MARS is a longitudinal study designed to identify medical, environmental and genetic factors with implications for the pathogenesis and progression of AMD. A total of 1,060 study participants were examined at baseline (June 2001-October 2003, MARS-I) and during the first follow up examination (November 2004-September 2006, MARS-II). All participants were of Caucasian origin. In this study all patients with gradable images (total of 909 participants) were used. Stereoscopic digital colour photographs (30°) were taken from both eyes, centred on the fovea (ZEISS FF 450 fundus camera, ZEISS Oberkochen, Germany) [14]. The recruitment and research protocols were reviewed and approved by the Institutional Review Board of the University of Müenster, and written informed consent was obtained from all study participants, in compliance with the Declaration of Helsinki [14,15].
Diabetic retinopathy grading
The human grading was performed by trained and certified graders at MEHRC, London, UK. The results from human graders were used as the standard grading reference. All grading is based on the International Clinical Diabetic Retinopathy (ICDR) and Diabetic Macular Edema Disease Severity scale [16]. Each patient was categorised into no apparent DR (DR 0), mild non-proliferative DR (NPDR) (DR 1) with microaneurysm(s) only, moderate NPDR (DR 2) with more changes than mild NPDR but less than severe NPDR (DR 3). Severe NPDR is present with the ETDRS 4:2:1 rule: >20 intraretinal haemorrhages in all 4 quadrants, definite venous beading in ≥ 2 quadrant, prominent intra-retinal microvascular abnormalities (IRMA) ≥ 1 quadrant and no signs of proliferative DR (PDR) (DR 4) where neovascularization and/or vitreous or preretinal haemorrhages is seen [16]. PDR is set to be a sight-threatening disease.
DAPHNE automated software
The current system is to separate images into disease/no disease state based on detecting any potentially DR lesions on the retinal image. The process of automatic DR system is divided into image quality filtering, anatomical structure detections (including vessels, macula and optic disc), lesion detections (such as white lesions, haemorrhages and microaneurysms) and artefact detections [10]. The flowchart for the system is shown in Figure 1.
Main input of the current system
The green channel of retinal images has been used in this current method as it often shows higher contrast between dark objects and their surrounding background.
Blood vessel tracking
As retinal blood vessels are the most stable features on retinal images, reliable vessel segmentation is the prerequisite for good quality automatic retinal image analysis. Therefore, the vessel centrelines are firstly extracted through singular spectrum analysis [17]. Each detected centreline point is regarded as an initial point to generate the vessel network by measuring local information. Finally, a trace-back mechanism is used to reconstruct the arteries and veins network as shown in Figure 2 [10].
Image quality filtering
Measuring image quality is fundamental for the automated systems as it determines if image quality is sufficient for grading purpose using the automated analysis. The proposed system relies on the assumption that a normal quality image should include sufficient retinal landmarks such as the vessel network. In this stage, a polar coordinate system is employed to define a number of local windows. The vessels distribution in each local window is obtained by measuring the number of vessel centrelines in each window. The vessel centrelines are extracted in vessel tracking stage, an example is seen in Figure 3. The number of vessel centrelines in each local window will be regarded as a feature to classify the quality level of the retinal images (gradable or ungradable).
Sample images with quality levels graded by human experts were first obtained. Their vessel measurements as described above were obtained and used for training together with grading levels. Once the system is trained, any previous unseen image will go through the same procedure having their vessel structure extracted and their quality level is then predicted by the automated system. After quality filtering the grading can take place [10].
Optic disc and macula detection
Our approach for locating the position of optic disc (OD) is by using the vessel network, as the main retina vessels converge to the OD.
The candidate region of macula is defined as a circular area and it is located at about 2-disc diameter (2DD) temporal to the optic disc in the retinal images, as shown in Figure 4. Therefore, these anatomical constraints are used to search for a candidate macula region.
Bright lesions detection
In this process, all bright regions in the image will be detected as candidate regions. Since the optic disc has been identified in the previous stage, it is excluded from the candidate list. For the remaining bright regions, the means and standard deviations of different colour channels (RGB, HIS and LUV) are extracted as the features for classifying them as bright lesions or non-lesions through a combination of multiple classifiers [10].
Dot haemorrhages and microaneurysms detection
Since dot haemorrhages and microaneurysms (MA) are circular shapes, the cross-section profiles of these circular shapes play an important role for an effective separation between them and other objects. Candidate objects are firstly extracted and then their crosssection profiles along multiple directions are processed. A set of statistical features of these profiles is then extracted and refined to separate dot haemorrhages and microaneurysms from their similar candidate objects such as background noise, other most common interfering objects and artefacts. The process is illustrated in Figure 5 [10].
Figure 5: A) An MA B) the cross-section lines C) SSA-based cross-section profiles, Candidates of MA and haemorrhages are firstly extracted and then their cross-section profiles, statistical features of these profiles is then extracted and refined to separate dot haemorrhages and microaneurysms from background noise and artefacts.
Haemorrhages detection
While dot haemorrhages are detected together with MA detection, other types of haemorrhages (such as elongated ones), still need to be identified. The complexity of haemorrhage detection lies in the fact that the shape and size of haemorrhages can vary hugely. Once retinal blood vessels are detected (see above) as large dark objects, the majority of blood vessels will be firstly removed and the remaining dark objects will be classified as haemorrhages or non- haemorrhages. Moreover, some elongated haemorrhages can be connected to vessels as shown in Figure 6. These haemorrhages are separated from the vessel during the vessel network tracking process [10].
Comparison between human and automated grading
Once grading was completed independently both by human graders and DAPHNE, a comparison between outcomes for all 3 studies were made.
DAPHNE output gives a value of 0 for no detectable DR changes and 1 for detectable DR; an image recognized as ungradable by the software also gives the output of image quality 0. These can then afterwards be converted into a value of 1 (potential detectable retinopathy), to signify the need to be graded by a human grader. The results from human grading was then put to equal values with every grade from 1-4 (from the grading scale) transformed into 1 as detectable DR and 0 for no DR changes per patient. Patient with ungradable images is marked with a positive outcome (value 1) by human graders as well for fair comparison to DAPHNE’s output, so if both human and software deemed an image ungradable, it would be classified as true positive comparison. If the software detects an image as ungradable and the human grader detects it as gradable but with no DR changes, this is considered a false positive outcome.
This study concentrated on DAPHNE’s ability to detect DR on retinal images when compared to human graders. Therefore comparison with HbA1c, DM type or other co-morbidities and DR grading was outside of the scope of this study.
Ethical approvals
All images were anonymized before submitting to MEHRC for grading. All studies had relevant ethical approvals and all studies followed the Helsinki declaration.
HAPIEE study (Lithuania)
In this study 1014 participants were included. Human grading from MEHRC detected 94 referable DR cases giving a prevalence of 9.3%. A total of 91 participants showed NPDR changes and 3 had PDR. Humans graded 46 participants to have at least one eye ungradable due to insufficient image quality or obscuring objects.
Table 1 shows the result from the DAPHNE software. It had a total of 165 positive cases, 16.3% of the study population. The sensitivity (SE) was 93.6% (95% CI, 88.1-97.0%), and specificity (SP) of 84.6% (95% CI, 82.0-86.9%). The positive predictive value (PPV) was 49.3% (95% CI, 43.1-55.4%) and the negative predictive value (NPV) was 98.8% (95% CI, 97.7-99.5%). Overall agreement between software and human grading was 85.8%. The false positive rate of the DAPHNE positive results was 48.5%.
Reading Centre, MEH | |||||
Positive | Negative | Ungradable | |||
DAPHNE software |
Positive | 85 | 80 | 0 | 165 |
Negative | 9 | 739 | 0 | 748 | |
Ungradable | 0 | 55 | 46 | 101 | |
94 | 874 | 46 | 1014 |
Table 1: Results from HAPIEE study, the results of human grading and DAPHNE software. The outcome is divided into positive, negative and undgradable and compared between human and DAPHNE.
DAPHNE declared 6.4% of all the fundus images to have insufficient quality for automated grading.
PAMDI study (Italy)
In the PAMDI study a total of 882 participants were included. Human grading detected 68 referable DR cases, giving a prevalence of 7.7%. A total of 67 participants showed NPDR changes and 1 had PDR. Humans graded 11 participants to have at least one eye ungradable due to insufficient image quality or obscuring objects.
Table 2 shows the results from the DAPHNE software. It detected a total of 201 positive cases, 22.8% of the study population. The SE was 97.5% (95% CI, 91.2-99.7%), SP 81.4 (95% CI, 78.6.5-84.1%). The PPV was 34.1% (95% CI, 27.9-40.6%) and NPV 99.7% (95% CI, 98.9-100%). Overall agreement between software and human grading was 82.9%. The false positive rate of the DAPHNE positive results was 66.7%.
Reading Centre, MEH | |||||
Positive | Negative | Ungradable | |||
DAPHNE software |
Positive | 67 | 134 | 0 | 201 |
Negative | 1 | 654 | 1 | 656 | |
Ungradable | 0 | 15 | 10 | 25 | |
68 | 803 | 11 | 882 |
Table 2: Results from PAMDI study, the results of human grading and DAPHNE software. The outcome is divided into positive, negative and undgradable and compared between human and DAPHNE.
In the PAMDI study, DAPHNE found 4.2% of all the fundus images to have insufficient quality for automated grading while human graders graded.
MARS study (Germany)
In the MARS study a total of 909 participants were included. Human grader detected 60 referable DR cases, giving a prevalence of 6.6%. A total of 57 showed non proliferative DR and 3 participants had PDR in at least one eye. Humans graded 3 participants to have at least one eye upgradable due to insufficient image quality or obscuring objects.
Table 3 shows the results from the DAPHNE software. It had a total of 217 positive cases, 23.9% of the study population. The SE was 98.4% (95% CI, 91.5-100%), SP 81.6 (95% CI, 78.9-81.2%). The PPV was 28.6% (95% CI 22.7-35.1%) and NPV 99.9% (95% CI, 99.2-100%). Overall agreement between software and human grading was 82.5%. The false positive rate of the DAPHNE positive results was 71.2%.
Reading Centre, MEH | |||||
Positive | Negative | Ungradable | |||
DAPHNE software |
Positive | 59 | 153 | 3 | 215 |
Negative | 1 | 689 | 0 | 690 | |
Ungradable | 0 | 2 | 0 | 2 | |
60 | 844 | 3 | 907 |
Table 3: Results from MARS study, the results of human grading and DAPHNE software. The outcome is divided into positive, negative and undgradable and compared between human and DAPHNE.
In the MARS Study the images were taken under clinical trial conditions and this results in a low only 0.3% of all fundus images having insufficient quality for automated grading, this is the same amount as declared upgradable by human graders.
ROC for the three populations
All the results from each of the three populations were used to produce a ROC. These results are combined in Figure 7. The results of the area under curve (AUC) for each study were as followed; Lithuania (HAPIEE study) 0.9689, Italy (PAMDI study) 0.9731 and for Germany (MARS study) 0.9730.
This is the first time the DAPHNE software has been tested on three different populations concurrently in order to test its ability to function on image sets from three different countries with different camera types and settings. It detected DR changes with SE above 93% in all three studies with excellent NPV of above 98% in all studies without missing sight-threatening DR in any of these studies. With these high SE results it managed to stay above the UK guidelines for SE in a screening setting of 80% [18], albeit on population based studies, but for the same disease, DR, as in DR screening. There was only a small difference in SE between the different populations, ranging from 93.6-98.4% and for NPV with an even smaller range of 98.8-99.9%. In all three studies the overall agreement between DAPHNE software and human grading was above 82% with impressive AUC for each of the three populations.
Image quality is clearly important both for human and automated grading, as the study with the lowest SE and NPV was the HAPIEE study, that had the highest number of ungradable patients due to poor image quality.
The results show an overestimation of diseased retinae with higher than expected level of false positive results with a total rate of 63%. One reason for this is the number of images analysed to be ungradable by the software. For each image per patient ungradable the whole patient becomes a positive outcome as a safety for not overseeing disease in insufficient quality images. This is a safety feature of the current software, but admittedly it raises the number of images to be re-graded by human graders. In the future this element should be improved on so the automated software should mimic human graders’ workflow and allow for enough gradable area per eye to be treated as a gradable image. Another major reason is that the software still needs to learn identification of lesions related to other diseases mimicking DR, such as drusen and pigmentary changes in age-related macular degeneration (AMD), where drusen can be mistaken for exudate and pigmentary changes might be understood as haemorrhages. Currently, the artefact detection system works well and filters out the repeated artefacts and so these cause less of an issue. Based on our results, DAPHNE might be safe to use in population based studies in order to minimise grading requirements for DR safely, as it has a good safety profile for being able to identify completely normal images. Should these have been screening images, no referable cases would have been missed [19].
Nevertheless, the software would have resulted in a 70% decrease for the number of patients needed graded by humans, assuming that only DAPHNE positive results were seen again by humans. Even if a small percentage of normals were to be seen by the software, DAPHNE would have done equal to other softwares with reduction of 60% in human grading input [5]. DAPHNE was superior in terms of reduction of manual human workload to other softwares as those reported with a reduction of 36.3% [20] and 48.4% [21], respectively.
The sensitivity and NPV of DAPHNE is similar to other automated software that has been tested on a large population with a SE of 91% and NPV of 98% [5], although that software has a higher cut-off and is destined to find disease that is likely to become sight-threatening requiring referral. Higher cut-off value might be more beneficial to some country’s requirements, while for others detecting any disease might be more meaningful, depending on capacity for shorter followups and treatment. DAPHNE shows particular strength is its high specificity, it being above 80% in all the three populations. This is higher than other software. Valverde et al. showed in their study a comparison between available automated softwares and their results on screening for DR [22]. They compared the Retinalyze System® with SE of 97% and SP of 75% [7]. The iGradingM® works similar to DAPHNE with “disease/no-disease”, it achieved SE of 90.5% and SP of 67.4% [23].The RetmarkerSR® achived SE of 95.8% and SP of 63.2% [24]. In common for these other softwares are approximately the same SE but DAPHNE showed higher SP compared to all of the other mentioned softwares.
For population based studies, such as the three used in this study, detecting any DR related change is essential for establishing true estimates of the burden of disease. Hence, it is fair to say that DAPHNE is showing promising results for determining disease/nodisease state, but there is further work required to allow more sophisticated sub-categorisation of the image sets.
Using automated images analysis is not the only avenue researchers pursue for better population coverage and more refined phenotyping of patient cohorts. Some groups worked with a combination of automated software and analysis of tear fluid proteomics biomarkers for detecting DR with a sensitivity/specificity of 0.93/0.78 [25,26]. These results are similar to ours although there are still methodological issues, it might be a promising way in cases where there is no possibility of taking images.
For the DAPHNE software to function in a real-life setting either for population based studies or for DR screening, it still needs to be able to identify other common eye diseases such as Age-related Macular Degeneration (AMD) and glaucoma. This could be the next stage in the development of fully functional retinal image analysis software. The DAPHNE team is working on implementing the optic disc grading tools into their software, and further testing will be carried out on this methodology to prove its worth [27]. The DAPHNE software also needs testing on other camera modalities such as ultra-wide-field imaging and not just standard fundus camera.
In conclusion, the DAPHNE software showed consistently good results for analysing DR in three different countries in Europe. Further head-to-head comparison with other softwares and the possibly the development of a combined algorithm taking advantage of the best elements of all softwares might produce the safest and best quality final DR software.
The authors of this paper thank the participants and staff of the HAPIEE, PAMDI and MARS Study, for permission to use the images for this current study. MBH is funded from University of Southern Denmark, Synoptik-Fonden and Familien Hede Nielsens Fond. TP is funded from the NIHR BMRC at Moorfields Eye Hospital NHS Foundation Trust and UCL Institute of Ophthalmology. This project was supported by the NSTIP strategic technologies program in the Kingdom of Saudi Arabia (Project No.: 10-INF1262-03). The authors also acknowledge with thanks the Science and Technology Unit, King Abdulaziz University for technical support. The authors would like to thank the Engineering and Physical Sciences Research Council (EPSRC) in the UK for supporting the foundation of this work.