Expert system for diagnosing diphtheria with k-nearest neighbor method

Outbreak Diphtheria is now a special concern for the Indonesian government and recorded as a case of extraordinary events (KLB) in 2017 [1]. Diphtheria is an infectious disease and can cause complications of dangerous and deadly diseases if not treated immediately, so far people often underestimate diseases that are often felt such as sore throat, flu, fever. Similarity of symptoms Diphtheria with common diseases and complications such as Myocarditis, Airway Obstruction, Acute Kidney Injury (AKI) makes Diphtheria difficult to treat due to viral infections that spread so quickly. Some complications of Diphtheria that can cause death if not treated immediately and there must be identification at the beginning for Diphtheria, an expert system is needed that can help the community and government in diagnosing Diphtheria.

Diphtheria. In this study the authors limit to only using 148 data of Diphtheria disease from previous studies, and using the KNN method with a similarity value of 0.8 to calculate the distance similarity from one case to a specific diagnosis whether cases including Diphtheria cases are classified as Myocarditis, Airway Obstruction, Acute Kidney Injury (AKI). The test in this study uses Cross Validation with a random K value from 1 to 10 to produce what percentage of the accuracy of the method applied in this study.
The first research is about how K-Nearest Neighbor can identify efficiency in the selection of documents based on the categories taken in the document. It was concluded in the study that the model used to process and select similar documents was based on words, but the use of words to process document content still caused two problems, namely semantic confusion, words that implied that training documentation was inappropriate, shared the same words that we want to classify and then choose can increase noise, in contrast to cases of disease diagnosis that clearly the attributes used are the results of expert analysis and from medical record data [2]. The second study applied 60 medical record data on intestinal diseases taken from RSUD dr. Soetrasno Rembang with a scenario of 40 source cases and 20 target cases, it can be seen that the accuracy of the system diagnosis is 95%, but in this study there is no greater amount of data, and the attributes used are all related. symptoms that have not entered the data Characteristics, of 60 data classified into 5 underlying intestinal diseases, not included in complications of the disease [3]. The third study analyzed the limited number of mechanics and user knowledge which resulted in various difficulties and treatments in dealing with damage to the machine. Creating an artificial intelligence system with a case-based reasoning concept can facilitate the initial diagnosis of a case, but this study does not mention the amount of training data used in the study and accuracy is not stated [4].
The fourth study is about the automatic diagnosis of determination of non-critical diseases or in cases where doctors may not be available to diagnose in remote locations. By learning different data mining techniques, we found techniques suitable for making medical diagnoses using KNN and achieving accuracy above 84% [5]. In the study did not mention how much data used in the experiment other than that there was no information about the class of Diphtheria disease. The fifth study discusses the diagnosis of heart disease with accuracy reaching between 94% and 97.4%, respectively, with a K value between one and 13 [6]. This study does not determine how much data is used and only focuses on heart disease for the research case. The sixth study discusses developing a powerful method for diagnosis of heart disease, which is needed to introduce a meta-heuristic optimization algorithm to select the most relevant data from the heart disease dataset that will be used for the classification of heart disease [7]. In this study discusses K-means in classification and integration in KNN with genetic algorithms that improve accuracy classification for many datasets, not discuss accuracy.

Data Mining
Data mining is a method used to find knowledge that can be used to describe data in a database. Data mining has a number of processes that are quite a lot starting from the process of data sorting, data normalization and retrieving information from large amounts of data [8]. Mining data is very close to the analysis process, interpreting data that has a large source of information, and has meaningful patterns and rules. Some important things in data mining include the following [9].
1. Automation in a data has patterns and rules.
2. Automation on large data.
3. Automation in data mining to get meaningful improvement of patterns and references for indications and useful information.
Below we explain the process of mining data described in Figure 1 [10]. Data mining has several process steps ranging from data sources in the form of relational and documental databases, as well as flat files which then go into the process of extracting, cleaning and storing data, here before becoming data warehouses that were previously irregular or widely modified and extracted, and loaded become a data warehouse. The last process from the data warehouse can be processed into data that has patterns and rules that are packaged in data mining, so that it can provide useful information and knowledge. The data presentation illustrates how data can be easily accessible sources of knowledge, previously from very large data sources and without benefits because temporary use can become data mining that can provide useful information [10].

Expert System
Expert systems are part of artificial intelligence, besides this system has been developed for a long time. Expert systems are closely related to knowledge summarized in large data, the source of this knowledge is the knowledge of an expert or expert in solving problems so that they get data patterns that can produce information to their users through a computerized system. Expert systems identify human knowledge and then create patterns or rows in large datasets and data entered into computers to solve problems based on cases or problems that are usually resolved by experts or experts [11]. In general, expert systems are very influential on the automation of a knowledge that is processed in a database, then the data will provide information needed by the user to solve problems that arise. The following is the Expert System work concept described in Figure 2 [12].

Fig. 2. Expert system proses
Expert systems are in accordance with the explanation in Figure 2. Including environmental consultations, the development of a human knowledge environment is compiled in a database, with the help of a user interface that will bridge scheduled and interconnected inference machines with knowladge bases and working memory used to solve problems that are usually resolved by experts [ 13].

K-Nearest Neighbor (KNN)
The KNN method is a data classification algorithm or object that has proximity to one another [14]. KNN can classify new data that has no known similarities to the data already owned, by selecting the number of its closest neighbors to find the similarity value with the new data. The object that has the closest similarity value to the k number is specified, then the new data object is selected. The specified k value uses odd number rules to get different distances in the data classification process [15]. Similarity is a way to calculate the value of the similarity of new data with existing data. The equation used in the calculation to find the value of similarity to a new object with the old object is described in equation (1)

Method
This study has several stages ranging from literature studies to experimental methods and through the process of testing in the KNN method to test the similarity symptoms of Diphtheria with data comparison. In this study applying the working concept of the Expert System and data mining processing from the knowledge base to get the results of early diagnosis of Diphtheria quickly so that it has not entered into complications of the disease that causes death.  The data collection process related to Diphtheria can use data from various valid sources that discuss Diphtheria, from previous research data on Diphtheria disease and its characteristics can be used for trials in this study. Expert interviews are needed to obtain information and validation of data collected about the disease. Specialists are the first reference in gathering information and knowledge about Diphtheria. After all data has been collected, then enter the diphtheria analysis process using the KNN method to find the value of similarity in the case that has occurred, and if there is a symptom input in a new case whether the data has a high similarity value of 8 0.8 and classified according to the class of diphtheria . The following is a flow chart that can describe the KNN work process from the start of symptom input until the diagnosis is stored in memory, the following is described in Figure 5. The flow chart above explains that the first process that starts from symptom input will immediately go into the process of calculating the similarity value in the case using the KNN method. If the symptom result is ≥ 0.8, the system will accompany the case diagnosis and can be reused to multiply the case or data on the knowledge base. If the value of the similarity in the resulting symptoms is ≤ 0.8, the system will look for the right diagnosis so that the data will be a new case that can be stored in the database. The Knowledge Base will increase in number and get better if more new cases enter and provide a diagnosis of conclusions based on similarity values [19].

A. Data Collection
Collecting data in this study is taking data in previous studies sourced from medical record data from RSUD Dr. Soetomo Surabaya. The data used are data on the disease and their symptoms and doctor's diagnosis of suffering from Diphtheria patients. The following are the data that has been collected and presented in Table 1 and Table 2 [20]. Diagnostic data in 148 patients with Diphtheria found that the sex census consisted of 79 men (53.4%) and 69 (46.6%) women. There were 91 (61.5%) patients aged less than 5 years, 47 (31.8%) aged 5 to 10 years, and 10 (6.7%) aged more than 10 years, with an average age of 5.11 year. Mostly, 77 (52%) sufferers 2-5 days after the new symptoms went to the hospital. After this other data that has been collected about the clinical diagnosis of Diphtheria patient cases is presented in Table 2. patients with a percentage of 22.9%. Complications that often arise and are obtained are myocarditis in 17 patients from 148 patients with a percentage of 11.5%. Six patients with a 4% percent dose had tracheostomy therapy. Eleven patients with a percentage of 7.4% died, with the cause of death for myocarditis as many as 9 patients from 81.8% of patients who experienced complications.

B. Expert Interview
Interviews with experts are mandatory in the expert system, in this case Specialist Doctors are the most dominant source for interviews regarding Diphtheria. Doctor at RSUD Dr. Soetomo Surabaya as the speaker's choice in previous research, included a discussion of diagnostic data obtained until later in the expert system workflow on research to disperse the problems raised in this study. In addition to analyzing attribute data and analysis of attribute weight values for Diphtheria which will be processed by the KNN method, the following are presented in table 3. Positive Culture Analysis of the weight value obtained from the diagnosis data of the disease and clinical symptoms of patients with Diphtheria with the rule of weight values from 0 to 1, and references in giving their own weight to analysis of training data in 148 patient data and input from doctors, so that the data will be used as dataset in this study.

C. Calculation method
The calculation process in the KNN method to get a diagnosis in patients with beru. The KNN method will process data from new patients with existing diagnoses and be adjusted to the attribute data that has been made, then adjusted to the weight of each attribute. Experiments were carried out on new patients who experienced symptoms that had indications for Diphtheria, the following are shown in Table 4.  The data described in Table 5 are old data consisting of patient data, attributes, weights and attribute values of proximity of the similarity values in each symptom that will be processed in the calculation.
The second experiment with the case of the second long patient then looked at the similarity values and described in table 6.  The data presented in Table 6 are old data consisting of patient data, attributes, weights and attribute values of the proximity of the similarity values in each symptom that will be processed in the calculation.
Using the diagnosis of the previous case to calculate the value of similarity in each symptom. If the minimum similarity value ranges from a value of 0.8 to 1, then the diagnosis of the old case will be reused for the new case. The new patient data will be tested sequentially with all the old patient data until the similarity values of each old patient data are obtained, then from the similarity data obtained, the highest one is obtained to get the desired results according to equation (1). The results of the above study indicate that the new patient data with the old patient data get the equation 0.875 and 0.6 so that it can be concluded that the new patient is more similar to the old patient data in the first trial which has a diagnosis of complications from myocarditis. In this study the new data obtained will continue to be stored in the database, so that the training data we get will be better and can be used as an initial diagnostic reference for the characteristics of Diphtheria.

D. Testing
The testing process on 148 records of Diphtheria patients, using the KNN method and 30 data testing using Cross Validation obtained the results of manual calculation details with 94% accuracy. The training data provided is 148 data from medical records. The test data used from the collected data is 80% of the total data, namely 148 training data, and the data used for the experiment is 20% of the 30 data data, then the trial process will be carried out 10 times with different data. from 148 data. The following data test results are presented in Figure 6.  The results of testing new case data shows the results of testing medical record data with the number of data testing 30 new case data against 148 training data, obtained an accuracy rate of 93.056%.

Conclusion
The conclusion of this study that the KNN method can be used as an alternative to calculate distance and similarity of cases, such as the diagnosis of diphtheria. KNN can provide a fairly accurate diagnosis with 93.056% accuracy from 148 training data and 10 trials with 30 data tests. Expert System Analysis Diagnosis of Diphtheria with K-Nearest Neighbor Method can be done by modifying the method to get better accuracy and adding concepts to store data base knowledge for each new case with less similarity value.