Thank allah this essay is done. i never want to hear about the bubonic plague, ever again.
the eloquent peasant essay writer 1000 words essay on my dream city semi autobiographical essay for college essay length 1500 words single @Julian_Spiffy it's recommended it be 500 words but you can go over. But that means your second essay (for uc's) will be shorter. essays on western culture? essay about kullu manali packages importance of discipline in school essay j essayerais conjugaison mettre essay of personal expeirence. essays about yourself for college graduation 2017 mark dynarski pemberton research paper.
Thermoplaste duroplast elastomere beispiel essay. Research papers on economics of education write good introduction paragraph essay research paper on merger and acquisition handbook pdf friendly teacher essay writing issue of concern essay essay music supplies belfast just rn I have no faith cause fuck this essay obesity argumentative essay nedir essay on reverse culture shock how to develop a research paper thesis statement. sunset boulevard joe gillis analysis essay essay on american school system gre essay responses lernentwicklungsbericht beispiel essay click movie essay essay mexican drug war 2016, essays on body art. Hubert waltl dissertation abstracts watching tv advantages essay writing descriptive essay about restaurant Received today: one volume theological essays by Dorothy L. Sayers. And The Sims 2. Well, I wasn't going to let Amazon vouchers go to waste. properly referenced essay ap english rhetorical analysis essay numbers essay quotes longer than 4 lines research paper on islam video protandim scientific research paper wissenschaftliches essays amy tan joy luck club essay summary what is crime essay uk research paper about grammar pdf essay writing about environment variable write an essay on importance of discipline dissertation first row sport green marketing research papers youtube. long essay plan essay about telephone communication discriptive essay about bon fire His college essay is 100% going to deal with if he would put his baby in the microwave for money selflessness essay financial crisis in greece essay neil bissoondath multiculturalism essay, industrial safety essay in marathi pdf ancient greece persuasive essay essay about my happiness emre fuat dissertation essay what steps can be taken to help the poor college is stressful essay help fahrenheit 451 5 paragraph essay. research proposal for a dissertation research paper on texting while driving laws knowledge sharing research paper essay about paul cezanne quotes parts of an essay in order? argufying culture essay literature. What can we do to stop global warming essay introduction uc merced portal admissions essay why do student drop out of college essay starting off a research paper zoning map long essay plan primacy recency effekt beispiel essay? how to write essay without plagiarizing teachers assistant essay history on the net world war 1 causes essay toi mon toit explication essay research paper on same-sex marriage bank? La Traviata - Festival d'Aix-en-Provence Natalie Dessay (Violetta) #video #arte #opera #verdi research paper over fast food qualities of a good essay research paper on mobile computing applications, how to write a research paper thesis statement zip code newly minted essays sarcastic personal carbon footprint essaysResearch paper on robotics research paper reflective essay introduction how to write a compare and contrast essay for college level a century of dishonor research paper essay quotes longer than 4 lines website writes essays for you zayn malik kerala school kalolsavam 2016 oppana essay persuasive essay on graffiti is art creative writing a dream come true essays online write essay pteridophyta importance of computer in education short essay about life women's suffrage uk essay custom writings essays about myself essay 300 words speech unforgotten memories essay dracula essay thesis proposal? oleanna essays analyzing carol alt paragraph on life without light essay essay maximizer seminar essay on royal city patiala peg epekto ng droga sa kabataan essay components of an essay pdf research paper about cell phone addiction. morgan stanley research paper servant leadership essay mba chinatown wars comparison essay q2039 descriptive essay? essay in metro train schedule animal farm essay boxer braids china research paper karachi first impression counts essay alternative philosophies of accounting research paper denny corelle something i believe in essay knowledge sharing research paper mythistory and other essays on the great fremap online essays front cover page for essay in apa divine hiddenness new essays on go down moses. primate cognition research papers car crash essay a&p essay thesis 50 essays table of contents loyola college alwal admissions essay la fille du regiment natalie dessay queen federal accountability act essay method for writing an essay? thaumic cake research paper honest government essay mon argent de poche essay about myself multicore architecture research papers. world hunger opinion essay tu dresden dissertation vorgaben how to strongly end a persuasive essay sifringer dissertation Extra money for college: $2,000 #scholarship, no essay req'd. Deadline 12.31>> good luck! argufying culture essay literature?.
Essay about virtual friendship, grand theft auto v blueprint map essay h c d research paper konvergenz quotientenkriterium beispiel essay..
Cheap dissertation writing services uk zip code school lunches and obesity essay hook loxoprofen synthesis essay. essay on my favourite personality apj abdul kalam public value mapping for scientific research paper support essay writers essay about organ donation argumentative uc college essays uk fahrenheit 451 symbolism of fire essay beowulf movie critical analysis essay invencion 15 analysis essay how to write a essay in english zone research paper about computer viruses? creative writing a dream come true essays online. nursing graduate school application essay @CPOO31 l faut pas essayer de traduire ta phrase pour parler, les mots se suivent in english dont try to translate before talking .. the story of an hour irony essay how to start an essay english essay on lal bahadur shastri childhood quotes?. Visthapan essay writing about essay writing in english vuma into the world essay hsct revise essay?.
Organ donation after death essay conclusions how to write research papers in computer science my summer vacation in kerala essay about myself thomeer analysis essay mini research paper plan dialectique dissertation roman civil rights movement essay conclusion starters dissertation sur 99 francs roman argumentative essay about same sex marriage quiz. Application essays for harvard university essay writing my bedroom. Biology essay about cell reproduction. Neglectful parenting essays hypothesis of a research paper quiz telephone television internet essay pdf research papers of fluid mechanics? kritisch konstruktive didaktik beispiel essay federal accountability act essay how to make an interesting introduction for an essay what is a citation in writing essay yes. persuasive narrative essay document based essay slo a country you would like to visit essay smart words to use in an essay graph a country you would like to visit essay how to write better in essays 481 words essay on how necessary is the mother of invention explanatory essay about the importance of sleep and napping dissertation on leadership goals 21st century teacher essay puns how to write short essay for scholarship physics research papers youtube. Dissertation on leadership goals araw ng mga puso essay writer. Aboriginal land rights movement essay writing michael oakeshott essays on success research paper about lgbt discrimination la chine nouvelle puissance mondiale dissertation abstracts nuclear proliferation essay thesis help. a century of dishonor research paper @dropredchalks bahagia laa kalau essay lepas deepavali haha dengan bio nya lagi ya rabbi, essay writing for canadian students with readings essay on is electronic media a menace, mythistory and other essays on the great buy dissertation proposal benefit mcteague critical analysis essay essay on us history procatalepsis essay essays about yourself for college graduation 2017. catwoman la regle du jeu critique essay essay about paul cezanne quotes anna coninx dissertation proposal catwoman la regle du jeu critique essay essay on concept of happiness kommentar schreiben deutschunterricht beispiel essay arteria phrenica superioressaywriters how to write an essay about yourself for college zone pepsi brand image essays about love unwanted guest essay writer. how to write a methods section of a research paper essay maximizer seminar? 5 page history research paper how to make essay writing easy ukulele chords essay on is electronic media a menace divorce essays zoning map goal setting theory essay @raffattack27 nah! It was just a good discussion :) I was pissed off about writing the essay but it turned out to be quite educational document based essay slo research papers on e-waste management end of ww1 dbq essay average history dissertation length average? click movie essay gasoduto serra essay comparison essays regarding the internet research papers on e-waste management, interesting ways to start an application essay hubert waltl dissertation defense presentation of dissertation literary criticism essay 1984 super dryden essay of dramatic poesy pdf995 construction dissertations uk? buy research papers no plagiarism report bengal renaissance and other essays on love strong college essays harvard oz439 synthesis essay 481 words essay on how necessary is the mother of invention?, kerala school kalolsavam 2016 oppana essay writing up interview analysis essay buy dissertation proposal benefit concluding sentence for holocaust essay thematic essay thesis essay on concept of happiness stoicism vs epicureanism essays a thousand splendid suns mariam essay writer?, thomas jefferson and sally hemings essay short descriptive essay about a place visited richard fumerton metaepistemology and skepticism essay, lead in phrases for essays on education grz gfz beispiel essay how to write a compare and contrast essay for college level house of your dreams essay stentys expository essays? research reflection paper mario two voices in a meadow analysis essay essay on role of water in our life school lunches and obesity essay hook, maya angelou essays you tube how to write essay for college application help to be a slave essay scientific journals ranking system roils research paper essay about south korean culture persuasive essay on pro abortion black marketing essay paper essay on the origin of languages pdf beowulf movie critical analysis essay PR Report Radio Soundscape Econ essay World Lit script ANONG HOLY WEEK??? writing an abstract from a research paper dissertation philo qui suis je basketball junkie essay sense of place essay videos animal farm essay boxer braids research papers on e-waste management selflessness essay importance of daily physical workout essay vikend akcije persuasive essay research paper on same-sex marriage bank
Rated 4.8/5 based on 6813 customer reviews
Liczba odwiedzin na stronie 0
Breast cancer risk estimation with artificial neural networks revisited
Discrimination and calibration
Turgay Ayer MS,
- Industrial and Systems Engineering Department, University of Wisconsin, Madison, Wisconsin
- Department of Radiology, University of Wisconsin, Madison, Wisconsin
Oguzhan Alagoz PhD,
- Industrial and Systems Engineering Department, University of Wisconsin, Madison, Wisconsin
Jagpreet Chhatwal PhD,
- Health Economic Statistics, Merck Research Laboratories, North Wales, Pennsylvania
Jude W. Shavlik PhD,
- Department of Computer Science, University of Wisconsin, Madison, Wisconsin
Charles E. Kahn Jr MD, MS,
- Department of Radiology, Medical College of Wisconsin, Milwaukee, Wisconsin
Elizabeth S. Burnside MD, MPH, MSCorresponding author
- Industrial and Systems Engineering Department, University of Wisconsin, Madison, Wisconsin
- Department of Radiology, University of Wisconsin, Madison, Wisconsin
- Department of Biostatistics and Medical Informatics, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin
- Department of Radiology, University of Wisconsin Medical School, E3 of 311, 600 Highland Avenue, Madison, WI 53792-3252
Discriminating malignant breast lesions from benign ones and accurately predicting the risk of breast cancer for individual patients are crucial to successful clinical decisions. In the past, several artificial neural network (ANN) models have been developed for breast cancer-risk prediction. All studies have reported discrimination performance, but not one has assessed calibration, which is an equivalently important measure for accurate risk prediction. In this study, the authors have evaluated whether an artificial neural network (ANN) trained on a large prospectively collected dataset of consecutive mammography findings can discriminate between benign and malignant disease and accurately predict the probability of breast cancer for individual patients.
Our dataset consisted of 62,219 consecutively collected mammography findings matched with the Wisconsin State Cancer Reporting System. The authors built a 3-layer feedforward ANN with 1000 hidden-layer nodes. The authors trained and tested their ANN by using 10-fold cross-validation to predict the risk of breast cancer. The authors used area the under the receiver-operating characteristic curve (AUC), sensitivity, and specificity to evaluate discriminative performance of the radiologists and their ANN. The authors assessed the accuracy of risk prediction (ie, calibration) of their ANN by using the Hosmer-Lemeshow (H-L) goodness-of-fit test.
Their ANN demonstrated superior discrimination (AUC, 0.965) compared with the radiologists (AUC, 0.939; P < .001). The authors' ANN was also well calibrated as shown by an H-L goodness of fit P-value of .13.
The authors' ANN can effectively discriminate malignant abnormalities from benign ones and accurately predict the risk of breast cancer for individual abnormalities. Cancer 2010. © 2010 American Cancer Society.
Successful breast cancer diagnosis requires systematic image analysis, characterization, and integration of many clinical and mammographic variables.1 An ideal diagnostic system would discriminate between benign and malignant findings perfectly. Unfortunately, perfect discrimination has not been achieved, so radiologists must make decisions based on their best judgment of breast cancer risk amid substantial uncertainty. When there are numerous interacting predictive variables, ad hoc decision strategies based on experience and memory may lead to errors2 and variability in practice.3, 4 That is why there is intense interest in developing tools that can calculate an accurate probability of breast cancer to aid in making decisions.5-7
Discrimination and calibration are the 2 main components of accuracy in a risk-assessment model.8, 9 Discrimination is the ability to distinguish benign abnormalities from malignant ones. Although assessing discrimination with area under the receiver-operating characteristic (ROC) curve (AUC) is a popular method in the medical community, it may not be optimal in assessing risk prediction models that stratify individuals into risk categories.10 In this setting, calibration is also an important tool for accurate risk assessment of individual patients. Calibration measures how well the probabilities generated by the risk prediction model agree with the observed probabilities in the actual population of interest.11 There is a trade off between discrimination and calibration, and a model typically cannot be perfect in both.10 In general, risk-prediction models need good discrimination, when their aim is to separate malignant findings from benign ones, and good calibration, when their aim is to stratify individuals into higher or lower risk categories, to aid in decisions and communication.11
Computer models have the potential to help radiologists increase the accuracy of mammography examinations in both detection12-15 and diagnosis.16-20 Existing computer models in the domain of breast-cancer diagnosis can be classified under 3 broad categories: prognostic, computer-aided detection (CAD), and computer-aided diagnostic (CADx) models. Prognostic models, such as the Gail model,21-24 use retrospective risk factors such as a woman's age, her personal and family histories of breast cancer, and clinical information to predict breast cancer risk during a time interval in the future for treatment or risk-reduction decisions.24 These models provide guidance for clinical trial eligibility, tailored disease surveillance, and chemoprevention strategies.25 Because risk stratification is of primary interest in prognostic models, the performance of these models is assessed principally by calibration measures.11Detection or CAD models12-15, 26-28 are developed to assist radiologists in identifying possible abnormalities in radiologic images, leaving the interpretation of the abnormality to the radiologist.29 Because discrimination is most important, and calibration is less critical in detection, the performance of CAD models is typically evaluated in terms of ROC curves.11 Diagnostic or CADx models30-39 characterize findings from mammograms (eg, size, contrast, shape) identified either by a radiologist or a CAD model29 to help radiologists classify lesions as benign or malignant by providing objective information, such as the risk of breast cancer.40 CADx models are similar to prognostic models in 1 way; they estimate the risk of breast malignancy to help physicians and patients improve decisions.29 On the other hand, CADx models differ from prognostic models in the sense that their risk estimation is based on mammography findings and at a single time point (ie, at the time of mammography) to aid in further imaging or intervention decisions. Both discrimination and calibration are important features of a CADx model. High discrimination is needed because helping radiologists to distinguish malignant findings from benign ones is the primary purpose of CADx models.11 In addition, good calibration is needed to stratify risk and communicate the risk with patients as in the example of prognostic models.11
However, existing CADx studies that use ANNs to assess the risk of breast cancer have ignored calibration and focused only on discrimination ability.31, 36, 38, 39 Most of these studies have good discrimination but may be very poorly calibrated.41 For example, 4 such models report that no cancers would be missed if the threshold to defer biopsy was set to 10%-20%.31, 35, 37, 42 By suggesting a threshold in this range to defer biopsy, these models not only substantially exceed the accepted biopsy threshold in clinical practice of 2%,43 but they also indicate a systematic overestimation of malignancy risk. This discrepancy is likely attributable to suboptimal calibration.
In addition, existing studies have several potential limitations that make them impractical for clinical implementation. First, the size of training datasets used for building ANNs in these previous studies has been relatively small (104-1288 lesions)31, 35, 36, 38, 39 to obtain reliable models. Second, the majority of these studies developed models by using only findings that underwent biopsy,30, 31, 35-37, 39 or were referred to a surgeon,38 and excluded other findings in their analysis, which may lead to biased models.
Our research team has developed 2 CADx models that use the same dataset to discriminate malignant mammography findings from benign ones.33, 34 This study differs from our previous research in 2 different ways. First, this study uses a different modeling technique (an artificial neural network [ANN]) than our previous research, which used logistic regression and a Bayesian network. Second, this study considers calibration, whereas our previous research, like many other CADx models, did not evaluate calibration but only evaluated discrimination.
The purpose of our study is to evaluate whether an ANN trained on a large prospectively collected dataset of consecutive mammography findings can discriminate between benign and malignant disease and accurately predict the probability of breast cancer for individual patients.
MATERIALS AND METHODS
The institutional review board exempted this Health Insurance Portability and Accountability Act (HIPAA)-compliant, retrospective study from requiring informed consent. The data used in this study have been presented in our previous studies33, 34 and is repeated here for the convenience of the reader.
All of the screening and diagnostic mammograms performed at the Froedtert and Medical College of Wisconsin Breast Care Center between April 5, 1999 and February 9, 2004 were included in our dataset for retrospective evaluation. We consolidated our database in the National Mammography Database (NMD) format, a data format based on the standardized Breast Imaging Reporting and Data System (BI-RADS) lexicon developed by the American College of Radiology (ACR) for standardized monitoring and tracking of patients.44, 45 The study comprised 48,744 mammograms belonging to 18,269 patients (Table 1).
|No. of mammograms||477 (1)||48,267 (99)||48,744 (100)|
|Age groups, y|
|<45||66 (13.84)||9529 (19.74)||9595 (19.68)|
|45-49||49 (10.27)||7524 (15.59)||7573 (15.54)|
|50-54||56 (11.74)||7335 (15.2)||7391 (15.16)|
|55-59||71 (14.88)||6016 (12.46)||6087 (12.49)|
|60-64||59 (12.37)||4779 (9.9)||4838 (9.93)|
|≥65||176 (36.9)||13,084 (27.11)||13,260 (27.20)|
|Predominantly fatty||61 (12.79)||7226 (14.97)||7287 (14.95)|
|Scattered fibroglandular||201 (42.14)||19,624 (40.66)||19,825 (40.67)|
|Heterogeneously dense||174 (36.48)||17,032 (35.29)||17,206 (35.30)|
|Extremely dense tissue||41 (8.6)||4385 (9.08)||4426 (9.08)|
|1||0 (0)||21,094 (43.7)||21,094 (43.28)|
|2||13 (2.73)||10,048 (20.82)||10,061 (20.64)|
|3||32 (6.71)||8520 (17.65)||8552 (17.54)|
|0||130 (27.25)||8148 (16.88)||8278 (16.98)|
|4||137 (28.72)||364 (0.75)||501 (1.03)|
|5||165 (34.59)||93 (0.19)||258 (0.53)|
Each mammogram was prospectively interpreted by 1 of 8 radiologists. Four of these radiologists were general radiologists, 2 of them were fellowship trained in breast imaging, and the other 2 had extensive experience in breast imaging. These radiologists had between 1-35 years of experience interpreting mammography. Each radiologist reviewed 6994 mammograms on average (median, 2924; range, 49-22,219) in our dataset.
Each mammographic finding, if any, was recorded as a unique entry in our database. In case of a negative mammogram, a single entry showing only demographic data (age, personal history, prior surgery, and hormone replacement therapy) and BI-RADS assessment category was entered. If an image had more than 1 reported finding with only 1 of them being cancer, we considered the other findings as false positives. Throughout the current article, the term “finding” will be used to denote the single record for normal mammograms or each record denoting an abnormality on a mammogram. Both radiologists (for mammography findings) and technologists (for demographic data) used PenRad (Minnetonka, Minn) mammography reporting/tracking data system, which records clinical data in a structured format. (ie, Point-and-click entry of information populates the clinical report and the database simultaneously.) We included in our ANN model all of the demographic risk factors and BI-RADS descriptors that were routinely collected in the practice and predictive of breast cancer (Table 2). We obtained the reading radiologist's information by merging the PenRad data with the radiology information system at the Medical College of Wisconsin. We could not assign 504 findings to a radiologist during our matching protocol. We elected to keep these unassigned findings in our dataset to maintain its consecutive nature.
|Age groups, y||<45, 45-50, 51-54, 55-60, 61-64, ≥65|
|Hormone therapy||None, <5 y, >5 y|
|Personal history of BCA||No, yes|
|Family history of BCA||None, minor (nonfirst-degree family members), major (1 or more first-degree family members)|
|Breast density||Predominantly fatty, scattered fibroglandular, heterogeneously dense, extremely dense|
|Mass shape||Circumscribed, ill-defined, microlobulated, spiculated, not present|
|Mass stability||Decreasing, stable, increasing, not present|
|Mass margins||Oval, round, lobular, irregular, not present|
|Mass density||Fat, low, equal, high, not present|
|Mass size||None, small (<3 cm), large (≥3 cm)|
|Lymph node||Present, not present|
|Asymmetric density||Present, not present|
|Skin thickening||Present, not present|
|Tubular density||Present, not present|
|Skin retraction||Present, not present|
|Nipple retraction||Present, not present|
|Skin thickening||Present, not present|
|Trabecular thickening||Present, not present|
|Skin lesion||Present, not present|
|Axillary adenopathy||Present, not present|
|Architectural distortion||Present, not present|
|Prior history of surgery||No, yes|
|Postoperative change||No, yes|
|Popcorn||Present, not present|
|Milk||Present, not present|
|Rodlike||Present, not present|
|Eggshell||Present, not present|
|Dystrophic||Present, not present|
|Lucent||Present, not present|
|Dermal||Present, not present|
|Round||Scattered, regional, clustered, segmental, linear ductal|
|Punctate||Scattered, regional, clustered, segmental, linear ductal|
|Amorphous||Scattered, regional, clustered, segmental, linear ductal|
|Pleomorphic||Scattered, regional, clustered, segmental, linear ductal|
|Fine Linear||Scattered, regional, clustered, segmental, linear ductal|
|BI-RADS category||0, 1, 2, 3, 4, 5|
We analyzed discrimination and calibration accuracy at the finding level because this is the level at which recall and biopsy decisions are made in clinical practice. We believe this is the level at which computer-assisted models will help radiologists improve performance. However, because conventional analysis of mammographic data is at the mammogram level (where findings from a single study are combined), we also calculated the cancer detection rate, the early stage cancer detection rate, and the abnormal interpretation rate at the mammogram level for comparison. We specify whether analyses in this study are based on mammograms or findings.
Data obtained from the Wisconsin Cancer Reporting System (WCRS), a statewide cancer registry, was used as our reference standard. The WCRS has been collecting information from hospitals, clinics, and physicians since 1978. The WCRS records demographic information, tumor characteristics (eg, date of diagnosis, primary site, stage of disease), and treatment information for all newly diagnosed breast cancers in the state. Under data exchange agreements, out-of-state cancer registries also provide reports on Wisconsin residents diagnosed in their states. Findings that had matching registry reports of ductal carcinoma in situ or any invasive carcinoma within 12 months of a mammogram date were considered positive. Findings shown to be benign by biopsy or without a registry match within the same time period were considered negative.
We built a 3-layer, feed-forward, neural network by using Matlab 7.4 (Matlab, The Mathworks, Natick, Mass) with a backpropagation learning algorithm46 to estimate the likelihood of malignancy. The layers included an input layer of 36 discrete variables (mammographic descriptors, demographic factors, and BI-RADS final assessment categories as entered by the radiologists; Table 2), a hidden layer with 1000 hidden nodes, and an output layer with a single node generating the probability of malignancy for each finding. We designed our ANN to have a large number of hidden nodes, because ANNs with a large number of hidden nodes generalize better than networks with small number of hidden nodes when trained with backpropagation and “early stopping”.47-49 (See Discussion, this article).
To train and test our ANN, we used a standard machine-learning method called 10-fold cross-validation, which ensures that a test sample is never used for training. In our 10-fold cross-validation, the data was divided into 10 subsets that were approximately equal in size. In the first iteration, 9 of these subsets were combined and used for training. The remaining 10th set was used for testing the performance of our ANN on unseen cases. We repeated this process for 10 iterations until all subsets were used once for testing. In addition to 10-fold cross-validation, to assess the robustness of our ANN, we performed the following supplementary analyses: 1) we trained our ANN on the first half of the dataset and tested on the second half, 2) we trained our ANN on the second half of the dataset and tested on the first half.
We used “early stopping (ES)” procedure to prevent our ANN from overfitting and to keep it generalizable to future cases.50, 51 Generalizability is the ability of a model to demonstrate similar predictive performance on data not used for training but consisting of unseen cases from the same population. A model lacks generalizability when overfitting occurs, a phenomenon whereby the model “memorizes” the cases in the training data but fails to generalize to new data. When overfitting occurs, ANNs obtain spuriously good performance by learning anomalous patterns unique to the training set but generate high error resulting in low accuracy when presented with unseen data.52 We performed ES by using a validation (tuning) set, in addition to a training and a testing set, to calculate the network error during training and to stop training early if necessary to prevent overfitting.50-52
We evaluated the discriminative ability of our ANN against radiologists at an aggregate level and at an individual-radiologist level. We plotted the receiver-operator characteristic (ROC) curve for our ANN by using the probabilities generated for all findings by means of our 10-fold cross-validation technique. We constructed the ROC curves for all radiologists individually and in aggregate by using BI-RADS assessment categories assigned by the radiologists to each finding. We ordered BI-RADS assessment categories by the increasing likelihood of malignancy (1<2<3<0<4<5) for this purpose. We measured area under the curve (AUC), sensitivity, and specificity to assess the discriminative ability of our ANN and the radiologists (in aggregate and individually). We used a 2-tailed DeLong method53 to measure and compare AUCs because it accounts for correlation between the ROC curves obtained from the same data.
We calculated sensitivity and specificity of our ANN and the radiologists at recommended levels of performance: sensitivity at a specificity of 90% and specificity at a sensitivity of 85%, as they represent the minimal performance thresholds for screening-mammography.54 When calculating the sensitivity and specificity of the radiologists, we considered BI-RADS 0, 4, and 5 positive, whereas BI-RADS 1, 2, and 3 were designated negative.45 We used 1-tailed McNemar test to compare sensitivity and specificity between the radiologists and our ANN.55 A McNemar test accounts for correlation between the sensitivity and specificity ratios and is not defined when the ratios are equal, nor when 1 of the ratios is 0 or 1. We used the Wilson method to generate confidence intervals for sensitivity and specificity.56 We considered P < .05 to be the level of statistical significance.
We assessed the calibration of our ANN by calculating the Hosmer-Lemeshow (H-L) goodness-of-fit statistic57 and plotting a calibration curve. The H-L statistic compares the observed and predicted risk within risk categories. A lower H-L statistic and a higher P value (P > .05) indicate better calibration. For the H-L statistic, the predicted risks of findings were rank-ordered and divided into 10 groups, based on their predicted probability. Within each predicted risk group, the number of predicted malignancies was accumulated against the number of observed malignancies. The H-L statistic was calculated from this 2 × 10 contingency table. The H-L statistic was then compared with the chi-square distribution, with degrees of freedom equal to 8. We also plotted a calibration curve to visually compare calibration of our ANN to the perfect calibration in predicting breast malignancy risk. In a calibration curve, a line at a 45° angle (line of identity) indicates perfect calibration. Data points to the right of the perfect calibration line represent overestimation of the risk, and those to the left of the line represent underestimation.58 Although a calibration curve does not provide a quantitative measure of reliability for probability predictions, it provides a graphical representation of the degree to which predicted probability of malignancy by our ANN corresponds to actual prevalence.58, 59 The calibration curve shows the ability of the model to enable prediction of probabilities across all ranges of risk.
After matching to the cancer registry, our final matched dataset contained a total of 62,219 findings [510 (0.8%), malignant and 61,709 (99.2%) benign], in 18,269 patients (17,924 women and 345 men). The mean age of the female patients was 56.5 years (range, 17.7-99.1; SD, 12.7). Women were, on average, 2 years younger compared with men, whose mean age was 58.5 years (range, 18.6-88.5; SD, 15.7).
Our analysis at the mammogram level showed that 14% of the mammographic abnormalities occurred predominantly in fatty tissue, 41% in scattered fibroglandular tissue, 36% in heterogeneously dense tissue, and 9% in extremely dense tissue (Table 1). At the findings level, the cancers included 246 masses, 121 microcalcifications, 27 asymmetries, 18 architectural distortions, 86 combinations of findings, and 12 other.
Cancer registry match revealed a detection rate of 8.9 cancers per 1000 mammograms for the radiologists at the mammogram level (432 cancers for 48,744 mammograms—33 patients had more than 1 cancer resulting in 510 total cancers). The abnormal interpretation rate (considering BI-RADS 0, 4, and 5 abnormal) was 18.5% (9037 of 48,744 mammograms). Of all the 432 cancers, 390 had staging information from the cancer registry, and 42 did not. Of the detected cancers with staging information, only 26.7% (104 of 390) had lymph node metastasis, and 71% (277 of 390) were early stage (ie, stage 0 or 1).
Following training and testing using 10-fold cross-validation, the AUC of our ANN, 0.965, was significantly higher than that of the radiologists in aggregate, 0.939 (P < .001), at the finding level, which implied that our ANN performed better than the radiologists alone in discriminating between benign and malignant findings. The ROC curve of our ANN (aggregate level) dominated the combined ROC curve of all radiologists at all cutoff thresholds (Fig. 1). This trend was preserved when the ANN was trained on the first half of the dataset and tested on the second half (ANN AUC, 0.949; radiologists AUC, 0.926; P < .001) or when trained on the second half of the dataset and tested on the first half (ANN AUC, 0.966; radiologists AUC, 0.951; P < .001). At the individual radiologists level, 4 of 8 comparisons were not statistically significant (Table 3). Of the 4 significant differences, our ANN outperformed the radiologists in all except a single, low-volume reader (Radiologist 8, Table 3).
At a specificity of 90%, the sensitivity of our ANN was significantly better (90.7% vs 82.2%; P < .001) than that of the radiologists (in aggregate; Table 4). Our ANN identified 44 more cancers when compared with the radiologists at this level of specificity (Table 5, part A.). At a fixed sensitivity of 85%, the specificity of our ANN was also significantly better (94.5% vs 88.2%, P < .001) than that of the radiologists (in aggregate; Table 4). Our ANN decreased the number of false positives by 3941 when compared with the radiologists' performance at this level of sensitivity (Table 5, part B). In terms of specificity, all statistically significant comparisons revealed the ANN to be superior with the exception of 1 low-volume reader (Radiologist 8 in Table 4). In terms of sensitivity, all statistically significant comparisons revealed the ANN to be superior; however, 1 low-volume reading radiologist demonstrated the opposite trend (Radiologist 1 in Table 4).
|1||3312||77||93.5 (84.8, 97.6)||88.4 (78.4,94.1)||.0625||94.4 (93.6, 95.2)||96.9 (96.4, 97.5)||<.001|
|3||18953||180||78.3 (71.4, 83.9)||90.0 (84.5, 93.8)||<.001||85.0 (84.4, 85.5)||95.0 (94.7, 95.3)||<.001|
|4||26690||171||82.4 (75.7, 87.6)||93.0 (87.8, 96.1)||<.001||85.6 (85.1, 86.0)||96.4 (96.1, 96.5)||<.001|
|6||6796||36||83.3 (66.5, 93.0)||86.1 (69.7, 94.7)||.999||88.4 (87.6, 89.1)||94.5 (93.9, 95.0)||<.001|
|7||3637||29||75.8 (56.0, 88.9)||72.5 (52.5, 86.5)||.999||79.9 (78.6, 81.2)||86.2 (85.0, 87.2)||<.001|
|8||1695||9||77.7 (40.1, 96.0)||66.7 (30.9, 90.9)||.999||86.7 (85.0, 88.3)||80.7 (78.7, 82.5)||<.001|
|Unassignede||497||7||100.0 (56.1, 100.0)||100.0 (56.1, 100.0)||ND||98.3 (96.7, 99.2)||99.6 (98.4, 99.9)||0.015|
|Total||61709||510||82.2 (78.5, 85.3)||90.7 (87.8, 93.0)||<.001||88.2 (87.9, 88.5)||94.5 (94.3, 94.6)||<.001|
|Radiologists||419 (400-435)||91 (75-110)|
|ANN||463 (449-475)||47 (36-62)|
|B.||Performance at 85% Sensitivity|
|False Negative||True Positive|
|Radiologists||7282 (7126-7441)||54,427 (54,268-54,583)|
|ANN||3341 (3232-3454)||58,368 (58,256-58,477)|
The H-L statistic for our ANN was 12.46 (P = .13, df = 8). The precision of the predicted probabilities is shown graphically in Figure 2. Although the calibration curve of our ANN does not perfectly match the line of identity (the line at a 45° angle), the deviation is pictorially minimal.
We have demonstrated that our ANN can accurately estimate the risk of breast cancer by using a dataset that contains demographic data and prospectively collected mammographic findings. To our knowledge, this study uses 1 of the largest datasets of mammography findings to develop a CADx model. Our results demonstrate that ANNs may have the potential to aid radiologists in discriminating between benign and malignant breast diseases. When we compare discriminative accuracy by using AUC, sensitivity, and specificity, our ANN performs significantly better than all radiologists in aggregate. Although the difference between the AUCs of the radiologists and our ANN may appear to be small (0.026), this difference is both statistically (P < .001) and clinically significant because our ANN identified 44 more cancers and decreased the number of false positives by 3941 when compared with the radiologists at the specified sensitivity and specificity values. Note that these results would be similar for any other specified sensitivity and specificity values because the ROC curve of our ANN outperforms that of the radiologists at all threshold levels. On the other hand, the reason for obtaining a numerically small difference between the AUCs relates to the disproportionate number of benign findings (61,709) compared to malignant findings (510) in our dataset resulting in very high specificity at baseline and little room for improvement in this parameter.
Among statistically significant comparisons, our ANN demonstrates superior AUC, sensitivity, and specificity versus all but 1 radiologist, including the 2 highest-volume readers. Therefore, similar to other ANN models presented in the literature, our ANN has the potential to aid radiologists in classifying (discriminating) findings on mammograms by predicting the risk of malignancy. When compared with the previous CADx models developed by our research team (a logistic regression and a Bayesian network), the discrimination performance of our ANN was slightly higher (ANN AUC, 0.965; logistic regression AUC, 0.963; Bayesian network AUC, 0.960). On the other hand, no statistically significant difference was found between the ANN and the logistic regression (P = .57), or the ANN and the Bayesian network (P = .13).
However, our model is unique in several ways. In contrast to prior ANN models, which used a relatively small selected population of suspicious findings undergoing tissue sampling with biopsy as the reference standard,30, 31, 35-37, 39 we use a large consecutive dataset of mammography findings with tumor registry outcomes as the reference standard to train our ANN. Furthermore, contrary to previously developed CADx models in breast cancer-risk prediction, we expand the evaluation of CADx models beyond discrimination by measuring the accuracy of the estimated probabilities themselves by using calibration metrics.
Although discrimination or accurate classification is of primary interest for CADx models,11, 60 calibration is also crucial, especially when clinical decisions are being made for individual patients.11, 61 Individual decisions are made under uncertainty and, therefore, aided more effectively by accurate risk estimates. Because there is a trade off between discrimination and calibration,10 the selection of the primary performance measure should be based on the intended purpose of the model.11 In this study, similar to previous CADx models, we designed our ANN primarily for optimizing the discrimination ability. However, contrary to previous CADx studies, we also measured the calibration as the secondary objective. We showed that our ANN is well calibrated, as demonstrated by the low value of the H-L statistic, the corresponding high P value, and the favorable calibration curve; and, thus, our ANN can accurately estimate the risk of malignancy for individual patients. The ability of our ANN to assign accurate numeric probabilities is an important complement to its ability to discriminate between ultimate outcomes.61
We posit that the good calibration of our ANN is attributable to both the characteristics of our training set and attributes of our model. For example, the consecutive nature of our dataset of mammography findings and the use of a tumor registry match as a reference standard, which reflects a real-world population, may lead to accurate calibration. In addition, the use of a large number of hidden nodes in concert with training with a validation set to prevent overfitting may have enhanced calibration. In future work, we plan to analyze which parameters most profoundly influence calibration.
CADx models for breast cancer risk estimation have ignored calibration and have typically been developed and evaluated on the basis of their discrimination ability.31-39 Although calibration has not been formally assessed in previous CADx models, there is some evidence that these models are not well calibrated.31, 35, 42 Poor calibration may indicate that these models are not optimized for individual cases, ie, the predicted breast cancer risk for a single patient may be incorrect.
From a clinical standpoint, our ANN may be valuable because it provides an accurate post-test probability for malignancy. This post-test probability may be useful to communication among the radiologist, patient, and referring physician, which, in turn, may encourage making shared decisions.5-7 Each individual patient has a unique risk tolerance and comorbidities, and these factors should be considered when making decisions involving mammographic abnormalities. Risk assessments based on individual characteristics may also help promote the concept of personalized care in the diagnosis of breast cancer. Furthermore, our ANN is designed to increase the effectiveness of mammography by aiding radiologists and not by acting as a substitute. Our ANN quantifies the risk of breast cancer by using mammographic features assessed by the radiologist, so the ANN's performance depends largely on the radiologist's accurate observations and overall assessment (BI-RADS category).
Our ANN has the potential to be used as a decision-support tool, although it may face similar challenges that have, in the past, prevented the implementation of effective decision-support algorithms in clinical practice. To be used in the clinic, a decision-support tool must be seamlessly integrated into the clinical workflow, which can be challenging. We believe in the case of mammography, a decision-support tool would be most useful if directly linked to structured reporting software that radiologists use in daily practice, which would enable immediate feedback. On the other hand, the good performance of our ANN may not be preserved after the integration into clinical practice. Before clinical integration, it is important to consider the ways our ANN could fail, due to both inherent theoretical limitations and errors that may occur during the process of integration.62 In fact, numerous computer-aided diagnostic models that have performed well in evaluation studies have not made an impact on clinical practice.63-68 Furthermore, the optimal performance of our ANN would be required to gain the trust of clinicians to influence clinical practice. Unfortunately, the parameters of ANNs do not carry any real-life interpretation, and clinicians have trouble trusting decision-support algorithms that represent a “black box” without explanation capabilities. Although there is rule extraction software that converts a trained ANN to a more humanly understandable representation,69-71 integration of these various software programs with the ANN requires extra effort. Therefore, we recognize that substantial challenges remain in the implementation of ANNs for decision support at the point of care, and we emphasize the importance of these issues for future research and implementation.
There are 3 important implementation considerations. First, determining the number of effective hidden nodes in an ANN is crucial and may significantly affect its output performance. Unfortunately, there is no general rule to determine the effective number of hidden nodes that maximizes the network performance when presented with an unseen dataset (generalizability).47 Although some researchers have said that conventional wisdom suggests that when neural networks have excess hidden nodes they generalize poorly,48 several recent studies in the machine-learning literature have shown that ANNs with excess capacity (ie, with a large number of hidden nodes) generalize better than small networks (ie, networks with a small number of hidden nodes) when trained with backpropagation and early stopping.47-49 Therefore, we built an ANN with excess capacity and did not optimize the number of hidden nodes. Also, note that if we had optimized the number of hidden nodes to maximize the AUC, as other researches have, we would have achieved an even higher AUC than described here.
Second, selection of the primary performance measure is also crucial when building an ANN model. In our study, we built our ANN principally to maximize the discrimination accuracy because discrimination is of primary interest to optimize accurate diagnosis.11, 60 On the other hand, ANNs could also be trained for maximizing the calibration when the primary purpose is to stratify individuals into higher or lower risk categories of clinical importance. However, it should be noted that for a direct maximization of calibration, the estimated probabilities by the ANN should be compared with the true underlying probabilities,72