COVID-19 Patient Health Prediction using Artificial Intelligence Boosted Random Forest Algorithm

In the current times, there is high demand for artificial intelligence (AI) techniques to be integration with real-time collection, wireless infrastructure, as well as processing in terms of end-user devices. It is now remarkable to make use of AI for detection as well as prediction of pandemics that are extremely large in nature. Coronavirus pandemic of 2019 (COVID-19) began in Wuhan, China and caused the deaths of 175,694 deaths around the world, while the number of active patients stands at 254,4792 patients around the world. In Pakistan, from January 2020 March 2021, there have been 658,132 positive cases, 603,512 recovered cases of COVID-19 with 16,208 deaths, reported by world health organization. Nonetheless, the quick and exponential increase in COVID-19 patients has made it necessary that quick and efficient predictions be made in terms of the possible outcomes with respect to the patient for the sake of suitable treatment by making use of AI techniques. A fine-tuned random forest model has been proposed by this paper, which has been given a boost by AdaBoost algorithm. The COVID-19 patient’s health, geographical area, gender, and marital status are used for the prediction of severity in terms of cases as well as possible outcomes, either recovery or no recovery (i.e. death). The model is 90% accurate and has a 0.76 F1 Score on the set of data used. Analysis of data shows a positive correlation with respect to the gender of patient, and death. It also shows that most of the patients had ages between twenty years and seventy years


INTRODUCTION
Being a vast industry, needing real time gathering as well as processing with respect to medical data, the healthcare industry faces the issue of data handling that needs predictions as well as dissemination of information done in real time so that practitioners can provide medical attention on time. The main stakeholders of this industry, including physicians, hospitals, vendors, and companies based on health have made attempts at collecting, managing, reviving data so as to use it for the enhancement of medical practices as well as innovation with respect to technology.
It has become a difficult task to deal with healthcare data because data has a massive volume, there are various issues related to security, incompetence related to application of wireless network application, as well as the velocity related to its increase. Therefore, for the sake of increased efficiency, more accuracy, as well as workflow, data analytics tools are required for the management of complex data by healthcare industries.
The coronavirus pandemic has caused an outbreak of respiratory illnesses around the globe that began in China's Wuhan region. Researches have demonstrated that the clinical characteristics of Covid-19 are similar to those of severe acute respiratory syndrome coronavirus 2(SARS-COV-2). The most common symptoms of COVID19 are cough and fever, meanwhile gastrointestinal symptoms are less common. Fever is more frequently found in patients who are infected with viruses such as MERS Corona Virus (2%) as well as SARS corona virus (1%) as compared to those infected with COVID-19 (2020); thus it is possible that the surveillance system misses non-febrile patients that focuses primarily on the detection of fever (Zumla A et al. 2015). The patients that are infected with COVID-19 were largely associated with the animal and seafood market in Wuhan which indicated a spread from animal to person. On the contrary, many patients have not been associated to the animal markets, demonstrating a human to human transmission. The coronavirus pandemic is considered a health emergency around the globe, spreading at a rapid pace (Pham QV et al. 2020). Punjab, Sindh, Khyber-Pakhtunkhwa, and Baluchistan for the period January 2020 to March 2021. In provinces, Punjab and Sindh have more COVID19 positive and death cases, while, Baluchistan and Khyber-Pakhtunkhwa have less positive and death cases. Moreover, Figure 1c irradiated that in Sindh province COVID19 recovered cases are growing at a much faster pace than predicted. As medical facilities are under massive stress, it is important that healthcare facilities and governments focus on the identification and treatment of cases that have more probability of surviving, and thus making good use of the scarce stock of medications as well as medical resources. The 21 st century has seen the advent of the breakthrough technology called artificial intelligence (AI) which has many applications in different fields including weather prediction, autonomous systems and astronomical exploration etc. (Kathiresan S et al. 2020). Some related researches have applied artificial intelligence to detect, prevent and predict so the pandemic can be fought. Researchers in Wang and Wong (2020) made use of a convolutional neural networkbased model for the detection of COVID19 patients by making use of CXR images. A pretrained ImageNet was used and the model was trained on open source dataset related to Chest X-Ray images (CXR). Pal et al. (2020) on the other hand, used a LSTM model for the prediction of country specific risk related to COVID19 which depends upon trends as well as weather data of that specific country so that the likely spread of COVID19 within that country can be predicted.
In Liu et al. (2020) the ML was applied by AI practitioners so that the internet activity, health organization reports, news reports and media activity can be processed so that the spread of the coronavirus outbreak can be predicted in China on providence level (Cai H, 2020). The authors in Bayes and Valdivieso (2020)  The aim of this paper is to bridge the gap with respect to traditional healthcare systems, by making use of the machine learning (ML) algorithms for the simultaneous processing of travel data as well as healthcare data, together with different parameters related to patients infected with COVID19, for the prediction of probable outcome related to the patient, on the basis of symptoms, history of travel, as well as delay with regard to the reporting of a case through identification of patterns using previous data of patients. We contribute in the form of i.
Processing data related to healthcare as well as travel by means of algorithms of machine learning instead of traditional healthcare systems for the identification of people infected with COVID19. ii. This research work has made a comparison of multiple algorithms to process patient data and has made an identification of boosted random forest to be the finest of these methods. After this, a grid search was executed for the fine tuning of hyper parameters with regard to boosted random forest algorithm for the improvement of performance. Through this research work, the requirement of re-comparison of existing algorithms to process patient data related to COVID19 is obliterated. Researchers will be further able to work towards the development of a solution which provides a combination of processing with regard to patient demographics, health data, as well as travel data to predict the health outcomes of COVID19 patients in a better way. The study has been organized in this way: The Methodology section DOI: http://doi.org/10.48165/sajssh.2021.2313 dilates upon materials and methods used, as well as description of the dataset, preprocessing of data, as well as data analysis regarding the classification algorithms that have been employed.
The results section focuses on the outcomes of the experiment, after which the discussion section is included. The conclusion section summarizes the outcomes, providing a conclusion as well as the scope of this current work in the future.

Dataset
Keeping in view the variables in the context of Pakistan, primary data was collected from 50 respondents by taking interviews from them on call. The variables included gender, marital status, patient health -weather the patient recovered or died, SOPs -whether they were followed or not followed, and whether or not symptoms such as malaise, cold, fatigue & body pain, fever and cough were observed.

DATA ANALYSIS
Cough, fever, cold, body pain, fatigue and malaise were found to be the symptoms that were most common among patients for whom the data was available within our set of data and   The correlation of people living in urban with death is equal is lower compared to correlation of people living in rural areas. In contrast the correlation of people living in urban areas with recovery is higher compared to correlation of people living in rural areas. The correlation of age with death is lesser compared to the correlation of age with recovery. Males have a higher correlation with death due to COVID19 as compared to females, while females and males have an equal correlation with recovery. Those who visited hospitals had a lower correlation with death compared to recovery with which the correlation was higher. Those who followed SOPs had a higher correlation with recovery as compared to correlation with death and similar in case of those who did not follow Sops. Those who suffered from malaise, fatigue/body pain, cold, cough and fever had a greater correlation with recovery as compared to death. Both married and unmarried individuals had a greater correlation with recovery as compared to death.

Data Pre-processing
The set of data has consisted of columns, while the data was date, numeric type and string. The categorical variables were also included in the set of data. As the ML model needs the input data to be in the form of numbers, label encoding was done for the categorical variables. Thus, to each unique categorical value included within column, a number is assigned. Some records of patient data include missing values for 'recov' as well as 'death'; columns, patient records of this type have been distinguished from the main set of data and a compilation is done into a test set of data, whereas rest of the records are compiled into a train set of data.

Evaluation Metrics
The given study aims to predict accurately the outcome of some specific patient relying upon many different factors, including demographics, travel history etc. As this is an important prediction, accuracy is crucial. So, for the evaluation of model, three evaluation metrics were considered for the study. These terms were employed in equations: TP represents true positive, TN represents true negative, FP stands for false positive, while FN represents false negative.

i. Accuracy
When the set of data includes (TP+TN) points of data, accuracy equals total correct prediction's ratio (TP + TN + FP + FN) to total data points by classifier. Accuracy is vital in measuring the classification model's performance. Accuracy can be calculated according to what has been shown in the below equation:

ii. Precision
Precision equals ratio of True Positive (TP) samples with respect to True Positive (TP) and False Positive (FP) samples combined. Precision is an important metric for the identification of the number of patients that have been specified correctly in a set of data that is imbalanced.
Precision has been calculated in the 2 nd equation given below:

iii. Recall Score
Recall also equals ratio of True Positive (TP) samples with respect to True Positive (TP) and False Negative (FN) samples combined. Recall is an important metric for the identification of number of patients that were classified in a class set of data that was imbalanced, out of all patients that were capable of being predicted correctly. The calculation of Recall is given in the third equation in the following manner: Recall Score = + (3)

iv. F1 Score
The F1 Score equals the recall and precision value's harmonic mean. A perfect balance is struck between precision and recall, thus giving true evaluation with respect to the performance of the model in COVID-19 patient's classification. This is an important measure that will be used for the evaluation of model. F1 Score is calculated as displayed in the fourth equation in the following way: F1 Score = 2 (4) As the set of data that was employed may be an imbalanced set of data, the F1 score will be used for comparison as the primary metric. Figures three to six show performances of models for each of the mentioned models. Figure 7 demonstrates decision tree that has been constructed so that the target variable can be estimated. Decision tree has depth equal to 2, while the Gini index for each of the nodes equals s <0.5, indicating imbalance with respect to training data. As the best model in terms of performance is the Boosted Random Forest algorithm, the model will be finetuned for the sake of improved performance with respect to the set of data.

DISCUSSION: BOOSTED RANDOM FOREST CLASSIFICATION
Boosted Random Forest represents an algorithm consisting of two parts; including the boosting algorithm: Random Forest classifier algorithm and the AdaBoost algorithm (27), that are made up of many decision trees. Models are built by decision trees that are quite like an actual tree.
The data is divided into small subsets by the algorithm, while also adding branches with respect to the tree at the same time. The result of this is a tree which consists of decision nodes as well as leaf nodes. There are two or more than two branches of a decision node, which represent value for every characteristic for example: age, symptom1 and so on that has been tested. Leaf nodes have the resultant value with respect to prospective condition of the patient which is the target value. Multiple classifier decision trees that are ensemble of classifiers, removes risk of failure with respect to one single decision tree so that the target value can be predicted correctly.
Therefore, the result obtained from multiple trees is averaged by the random forest so that the final result can be provided. Equation 5 expresses margin function with respect to random forest.
Equation 6 shows generalization error, while equation 7 shows confidence in prediction. ℎ 1 (x) , ℎ 2 (x) , . . . , ℎ (x) represent ensemble of classifiers that are the decision trees & training data has been taken from X and Y vectors.
Margin function may be expressed in the following manner: mg (X, Y) = a I (ℎ (X) = Y) − ma ≠Ya I (ℎ (X) = j) (5) Here, I(.) represents indicator function. Following is the generalization error: * = , ( ( , ) < 0) On the X, Y space, probability is expressed. ℎ (X) = h(X, Θ ) in random forests, thus number of classifiers or decision trees rises, for the entire sequences with respect to trees. The probability * and Equation (7) show convergence, from tree structure as well as Strong Law of Large Numbers.
, (P Θ (h (X, Θ ) = Y )− ma ≠ Θ (h(X, Θ) = j ) < 0) (7) Boosting algorithm AdaBoost (28) is applied and it gives corrective mechanism so that the model may be improved following each prediction of the state of patient. In the end, the decision is based on summation of every base model. This is one of ML's most efficient techniques.
Corrective mechanism may be given in the following manner: where, normalization factor is . Final hypothesis is obtained in the following manner: The dependent variable is this case is state of patient (recovered/dead) whereas the explanatory variables are gender, age, marital status, whether or not SOPS were followed, or none of the symptoms (1-6). The boosted random forest has been used as it has accurate classification performance even in case of sets of data that are imbalanced (25, 29).  Decision trees that have been shown in he figures 8, 9, 10 and 11 equals 2 in terms of depth.
Additionally, Gini index with respect to every leaf node of each tree is that depth of trees has been decreased to 2 and number of decision trees (estimators) have been increased to 50 within random forest. Thus, high variance is prevented in the model and accurate predictions are provided.

CONCLUSION AND FUTURE WORK
It is very important to apply Artificial Intelligence so that patient data can be processed for the efficacy of strategies related to treatment. This research work brought forward a model which provides implementation of Random Forest algorithm that has been boosted through AdaBoost algorithm that has an F1 score equal to 0.76 with respect to the set of data for COVID-19 patients. It has been discovered that accurate predictions are provided by the Boosted Random Forest algorithm even with respect to sets of data that are imbalanced. The data employed in the analysis for this study has shown that urban areas had higher death rates compared to death rate in Pakistan's rural areas. Also, death rates were higher among male patients in comparison with female patients. Most of the patients that were affected were between 20 years and 70 years of age. Work done in the future will be directed towards the creation of a pipeline which provides a combination of CXR scanning computer vision models as well as healthcare and demographic models that deal with data processing. Later, integration among models and applications will be done so that growth with respect mobile healthcare can be supported. Thus, a step can be taken towards a diagnostic system which is semi-autonomous and can supply screening as well as detection for regions affected by COVID-19 at a rapid pace, so that we can be well prepared for outbreaks in the future.