#20220090 Classification of the Mushroom data set and Ecoli protein data set





PROJE KODU20220090
PROJE SAHİBİNeriman Dilara Özcan
PROJE SAHİBİ LINKEDIN https://www.linkedin.com/in/neriman-dilara-%C3%B6zcan-246b68222/
PROJE MALİYETİThe project I have worked on did not need any financial support. I have basically just wrote my codes in a free node book.
PROJE ÜNİVERSİTESİAnkara Üniversitesi
PROJE KATEGORİSİToplumsal, Sosyal Medya ve Diğer
PROJE DANIŞMANIDr Öğr. Üyesi Yılmaz Ar



Classification problems are getting popular day by day. Especially for up-to-date problems like the Afghan family whose five years old son died after eating poisonous mushrooms. Before starting with the project, me and my supervisor have discussed how the program and the aim of the project could look like. We have decided to not only write a normal machine learning program, we also want to play around with the data set by adding data or deleting data from the data set to analyse the accuracy value given by the classification models. While developing the project, I have realized that the Mushroom data set is pretty balanced, so that we have decided to also use the E coli data set which is an imbalanced data set. In that way, I got he chance to compare these two data sets with each other and also to show two different approaches, because the mushroom data set is a binary classification and the E coli a multi class classification. This enabled me to create a project that analyses both binary and multi class classification. The programs were written in Python. The general structure of my program is divided into sub task. First, there is the need of having sufficient and historical data. Secondly, the data needs to be pre-processed so that this can be used for the machine learning process. The raw data cannot be used directly unfortunately. Next step is to choose a model. That means a decision should be made according to which algorithms need to be used to get an output with the highest accuracy. After that, the model needs to be trained. Before training the model, it is like a blank model. In other words, after training it, it becomes a trained model. Moreover, once we have trained the model, it needs to be tested to make sure that it is predicting correctly as well. After testing the model, the probability of getting a result with a low accuracy is very high at the beginning. For solving this, the model needs to be tuned up to get a higher accuracy. Afterwards, the program will be ready to do predictions. This is how my program is divided into sub tasks to predict a label from the data set that I have used.

My Bachelor Project is a machine learning project for classifying mushrooms as edible or poisonous and for classifying E coli protein as well. The projects content also includes analysis of the accuracy values depending on the classification models that were selected by me.

Not long ago, there was a new article about an Afghan family whose 5 years old son died after eating mushrooms which were poisonous. This is actually something that happens very often. Mushrooms can be distinguished between a variety of shapes, sizes, and colours etc. Being able to detect whether a given mushroom is poisonous or edible is something crucial to survive, so that developing a program to classify a mushroom as poisonous, or edible is something important for people, who live in rural areas or have traditional food including mushrooms like the Afghan family. Starting to help people to detect which mushrooms can be eaten, a classification problem of mushrooms as poisonous or edible could be very useful in that case. There are some programs developed by people, which are also published on the internet, that are to be find in YouTube or google as well. The point while developing a classification program like this, is to achieve a high accuracy as possible. That means, if a random set of the given dataset is chosen, the output should have a high accuracy so that we can say that the prediction of the trained model is true. I am setting the hypothesis, that my project is innovative, because me and my supervisor have decided to not only concentrate to just train and predict things, we are also analysing the output and comparing the models with each other. In addition to that, we have selected two different data sets, which are for binary and multi class classification. Because of the fact, that the mushroom data set is a pretty balanced data set, we have decided to use an imbalanced data set, which is the E coli data set. In that way, we got the chance to play around with the data set and to analyse the accuracy values that were given us. This made my project a research project that analyses how to work on a machine learning project, but also how one data set member can influence the accuracy value at the end. So, for people who would work on a machine learning project for the first time like me, this research project could be very useful for the general understanding.

The Project I have worked on is a machine learning project, that actually just needs a PC at the beginning. How to expand the usage of the developed program could be different. Together with other programs and codes it could be put into a device or just expanded on the pc by adding other functionalities like artificial intelligence for responding verbally to the user or by including a camara etc. It is actually very easy to work with this project because it is mainly giving as just numerical percentage values. We just need to be able to evaluate these percentage values in the problem context. By adding other functionalities, I think this project could be very useful in the society especially the mushroom problem.

- Retrieved on 26.10.2021 at 13.40 from google for searching a classification problem of mushrooms: https://milindsoorya.site/blog/mushroom-dataset-analysis-andclassification-python - Retrieved on 13.10.2021 at 14.15 from google for searching for the Mushroom dataset: UCI Mushroom Data | Kaggle - Retrieved on 13.10.2021 at 14:35 from google for searching for the Mushroom dataset: UCI Machine Learning Repository: Mushroom Data Set - Retrieved on 02.09.2021 at 21:20 from google for searching for the article where the five years old boy died because of eating mushrooms which were poisonous: https://www.rtl.de/cms/nach-flucht-aus-afghanistan-5-jaehriger-stirbtan-pilzvergiftung-im-polnischen-auffanglager-4824484.htm

none

I have tried to both write my codes understandable and also my reports understandable as well, so that they can be useful for people who need inspirations in developing machine learning projects. As I already mentioned at the other parts, eating poisonous mushrooms is still crucial for people who live in rural areas for example. A program like this could be helpful to prevent such a death. In addition to that, my project is including both binary and multi class classification, so that people read explanations and also understand what to do in such a machine learning project. The project I have developed could be easily expanded so that it can be expanded with artificial intelligence, or the program could be put into a device that is giving verbal feedback or there also could be developed a program on pc that is giving verbal response as well. These kinds of things could be added to this project for making it more useful in the society.

- Retrieved on 13.10.2021 at 14.15 from google for searching for the Mushroom dataset: UCI Mushroom Data | Kaggle - Retrieved on 13.10.2021 at 14:35 from google for searching for the Mushroom dataset: UCI Machine Learning Repository: Mushroom Data Set - Retrieved on 24.04.2022 at 8:00 from google for searching for the E coli dataset: https://machinelearningmastery.com/imbalanced-multiclass-classification-with-the-e-coli-dataset/