An Intelligent Diagnosis of Liver Diseases using Different Decision Tree Models

Background: Liver cancer is the third most common cause of cancer mortality. Artificial intelligence, as a diagnostic tool, can reduce physicians’ working load. However, the main fear is that due to the existence of many causes and factors, liver diseases are not easily diagnosed. This study analyzes liver disease intelligently. Various decision tree models were used in this research. Methods : The records of 583 patients in the North East of Andhra Pradesh, India, registered at the University of California in 2012, were collected. Decision tree models were compared by three measures of sensitivity, accuracy, and area under the ROC curve. Results : In this study, Decision-Stump showed better results than other models. Accuracy, sensitivity, and ROC curve of Decision-Stump were 71.3058, 1, and 0.646, respectively. Conclusion : The superior model with the highest precision is the Decision-Stump model. Therefore, the Decision-Stump model is recommended for liver disease diagnosis. This paper is invaluable for the allocation of health resources for risky people.


Introduction
Liver cancer is the third most common cause of cancer mortality (1). Approximately, 560 000 new cases of liver cancer are diagnosed annually in the world (2). Failure to diagnose liver diseases in the early stage is one of the problems associated with this kind of disease. The liver may work improperly, even when the injury to the liver is small (3). Early diagnosis of this disease is very important and increases the survival rate of patients.
Nowadays, different classification algorithms and models are used to predict and diagnose various diseases. These techniques can enhance diagnostic accuracy and help practitioners in early diagnosis.
Artificial intelligence has become effective in medical applications in the last few decades. Artificial intelligence refers to systems that can behave in the same way as intelligent human behaviors, including understanding complex situations, simulating the thinking processes and reasoning methods of human beings and their successful response, learning, and ability to acquire knowledge and reason for solving problems (4)(5)(6). The application of artificial intelligence in medicine is the main objective of the processing and analyzing medical information and communication between this medical information and the relevant users, which is based on the knowledge and experience of the operation of various systems in medicine and treatment. One of the most important uses of artificial intelligence in medicine is recognizing and diagnosing diseases (7,8). This application includes diagnostic models using various decision-making methods and intelligent systems. These models are based on the knowledge and experience of the given system, which provides this information to the computer. After that, the model or system is compared and evaluated with that information or model. As a result, the difference or recognition of the type of variation in the model is shown in the model compared to the natural model. Such as identifying different patterns in medical images (9,10), automatic diagnosis of diseases by signal (11)(12)(13)(14), the classification and recognition of various blood cells by the computer, and the mortality rate (15,16) are among the other types of this group.
Creating a comprehensive intelligent system minimizes the cost of information processing and storage and provides quick access to disease records.
Liver diseases can be diagnosed by different signs and symptoms and the analysis of enzyme levels (17). As various factors affect the diagnosis of liver diseases, this process can be error-prone and complicated. By applying classification algorithms, one can help practitioners identify and predict liver diseases. Data used in this research were collected from the records of 583 patients in the North East of Andhra Pradesh, India. Data were all registered at the University of California in 2012 (18). Our data included these variables such as age, gender, direct bilirubin, total bilirubin, total proteins, albumin/globulin (A/G) ratio, albumin, SGPT (serum glutamic-pyruvic transaminase), SGOT (serum glutamic-oxaloacetic transaminase), and alkaline phosphatase.

Materials and Methods
Our research used the decision tree models to diagnose liver disease. Decision trees are supporting tools that apply a tree-like graph (model of decisions) in their approach. Classification using a decision tree is a method commonly used in data mining. The main aim was to create a model of targeted variables based on input variables, also called input features. Based on this approach, we can consider learning like a tree with internal nodes connected to input features. In addition, the values of the input features are related to each edge. Each leaf shows a feature value of the target variables regarding the characteristic value of the input. This is from the root to the leaf. A tree is separated into subsets on the premise of the trait value of each test. This procedure is replicated on each subset in an iterative way called recursive partitioning. In this method, termination occurs when each subset has the same number of target variable nodes. The top-down procedure in the decision tree (19), considered a greedy algorithm, is known as a wellrounded strategy in the learning process of decision trees. Recent evidence indicates that some methods based on a bottom-up process can do this procedure (20). Our study used various tree models to diagnose and predict liver disease. Furthermore, these models have been evaluated and compared.

Data collection
We used 583 records of patients in the North East of Andhra Pradesh, India registered at the University of California in 2012 (18). The obtained data encompassed 416 patients with liver disease and 167 files of patients without liver disease. In this regard, two groups of liver disease and non-liver disease were considered the target variable. Our population included 441 males and 142 females. The database we used had ten liver diagnosis variables.

Statistical analysis
In the proposed method, the k-fold cross-validation was used for statistical tests. The statistical program in this study was Weka 3.8. It is a free software licensed under the GNU General Public License.
In this method, the data was partitioned into k subsets. The k-fold cross-validation held its advantages, as the method tends to be less biased than other methods (21). In each k iteration, each subset was used for validation, and k-1 ones were used for training. This procedure was repeated k times, and all data were used exactly k times for training and once for testing. Finally, a mean k-time validation result was selected as a final estimate value. In this study, the commonly used cross-validation of 10-folds was used (22).
Regarding splitting the data into training/test (or train/ validation/test) vs. k-fold cross-validation, it depended on the amount of data you had and how well this data represented the distribution of the information you want to apply the model. You would like to have an independent test set to verify your model's performance in an ideal world. Sometimes the dataset is not big enough to be split into training and test sets with those characteristics, so people use cross-validation to use as much data as possible for training and testing.

Results
Weka software (23) was used in this study. The data set included 583 patient records in which 416 patients had liver disease and 167 patients did not have liver disease. Thus, the target variable was divided into the two groups of alive or dead. The risk factors used in this study were age, gender, direct bilirubin, total bilirubin, total proteins, A/G ratio, albumin, SGPT, SGOT, and alkaline phosphatase.

The performance of the proposed model
In this study, rotation forest, AD-Tree, BF-Tree,

An intelligent diagnosis of liver disease
Decision Stump, J48, NB-Tree, random forest, and random tree were used to diagnose liver disease. These eight decision tree models were implemented, and the results of comparison have been summarized in Table 1.
As seen in this table, the results were compared in terms of accuracy, sensitivity, and the ROC curve. In each evaluation, the best value has been highlighted. Table 1 shows that the Decision-Stump model is the best model with the most evaluation merits. Values from 0 to 0.5 represent random classification in the ROC curve, and values from 0.5 to 1 indicate that the model has a general diagnostic ability. In this study, in Rotation-Forest, the base classifier was J48. The maximum and minimum size of a group was 3. Ten iterations was performed. The filter used to project the data was Principal Components. In this method, 50% of instances should be removed. In AD-Tree, ten boosting iterations are performed to complexity/accuracy tradeoff. More boosting iterations will result in larger (potentially more accurate) trees, making learning slower. Each iteration will add three nodes (1 split + 2 predictions) to the tree unless merging occurs. Expand all paths set in our method. This type of search to perform when building the tree. This option will do an exhaustive search. The other search methods are heuristic. An optimal solution is not guaranteed to find in these methods, but they are much faster. In BF-Tree, heuristic search is used for a binary split for nominal attributes. Two minimal number of instances at the terminal nodes were used. We used five folds in internal cross-validation. Post-pruning was set in the pruning strategy. The error rate was used as an error estimate.
The Gini index is used for the splitting criterion. Decision-Stump is usually used in conjunction with a boosting algorithm. Does regression (based on meansquared error) or classification (based on entropy). Missing is treated as a separate value. J48, in the confidence factor (0.25) used for pruning (smaller values incur more pruning). The minimum number of instances per leaf is 2. The number of folds is three, determining the amount of data used for reduced-error pruning. One fold is used for pruning, the rest for growing the tree. In this study, it is used to consider the subtree raising operation when pruning. NB-Tree is a class for generating a decision tree with naive Bayes classifiers at the leaves. Random forest is a class for constructing a forest of random trees. The depth of the trees can be unlimited in this model. This model generated a total of ten trees. Random-Tree is a class that constructs a tree based on K randomly chosen attributes at each node. It does not prune the tree. Also, has the option of estimating class probabilities based on the hold-out set (back fitting). The number of attributes is equal to log2 (number-of-attributes) + 1. Each leaf's weight must be at least one.

Discussion and Conclusion
A vital supporting organ in the human body is liver, and our survival is dependent upon this vital organ. But it should be considered that diseases of this organ are among the world's top 10 killer diseases. Liver cancer is the third leading cause of death worldwide. A great concern upon this disease is the problem of not detecting it, and different causes are suggested for it. Early diagnosis of liver injury is a critical step in the treatment. Therefore, the goal of this research was to suggest a model for the early diagnosis of this disease. The records of 583 patients in the North East of Andhra Pradesh, India, registered at the University of California in 2012 were collected. Three values of sensitivity, accuracy, and area under the ROC curve were used to compare these Decision Tree Models. Eight decision tree models were applied to evaluate this disease. In this study, Decision-Stump showed better results in comparison to other techniques. Therefore, the Decision-Stump model is recommended for liver disease diagnosis. This paper is invaluable in terms of research activities in health, and it is especially prominent in the allocation of health resources for risky people.
In accordance with our study, Nahar and Ara (24), in their study about liver disease prediction using different decision tree techniques, proved that Decision-Stump provides the highest accuracy than other techniques such as J48, REPTree, logistic model tree (LMT), random tree, random forest, and Hoefflin tree. Azam et al (25) predicted liver diseases by using a few machine learning-based approaches in their study. They constructed computational model-building techniques for liver disease prediction accurately. They used efficient classification algorithms such as perceptron, random forest, K-nearest neighbors (KNN), decision tree, and support vector machine (SVM) for predicting liver diseases. They showed that the KNN algorithm outperformed all other techniques with feature selection. Daş (26) in his study about a comparative study on the performance of classification algorithms such as neural network, high performance (HP) SVM, auto neural, HP forest, HP tree (decision tree), and HP neural for effective diagnosis of liver diseases, showed that HP Forest achieves The best value has been bolded. ROC, receiver operating characteristic.
the highest accuracy rate.