Machine Learning fundamentals – PART II
Microsoft still announcing new parts, new features and new products in last days and in the field of Azure technologies there are a huge amount of changes and improvements. One of these improvements is related to the topic of Machine Learning (I know, that this isn’t something new, but the new thing is, that anyone can do this “easily” using drag & drop model studio in the cloud/azure).
I already wrote an introduction to Machine Learning in one of my previous article HERE, but it’s so impressing, that I thing, that another article is definitely in place. Maybe one important think to repeat the basic ideas of Microsoft machine learning. The main thing is that Microsoft really simplifies the whole process of Machine Learning, because many time and many expenses had to be spent before to configure basic tasks, training sets to learn system how to learn in the first steps and finally there had to be enough computing resources to make thing happened. It’s also need to be said that Microsoft is not the first company, which went with Machine Learning concept – it was google with their API. But as it is usual, the easiest of thing from the uses, design and marketing points of view win the race and WOALA. The next great feature to whole concept of these azure concept is here. The Microsoft combines the Machine learning with his most using product of all days – SQL and Excel. Now with PowerBI it looks like its finally complete scenario, not just the parts with no connecting hub or something.
Picture 1- basic window with drag and drop tutorial
There is need to say another thing. The basic with ML Studio, which is provided to everyone using Azure portal is really helpful to learn and to discover the whole potential of Machine Learning concepts. As it is show on Picture 2, there is tool sets on the left side, where anyone can choose the data formats, inputs, outputs, operations etc. and on the right side there is properties to each of the part which was dropped into the modeling (in the center part) part of the studio. In my case, there is a sample analysis model shown. Well, this is another great thing – the existence of many sample diagrams / models and videos on the main page of the machine learning studio.
Now a little bit of practice fundamentals. Let’s say, that we would like to build a network intrusion prevention model. There is a sample experiment for this scenario as well as many other sample models. If I will select this concrete, the experiment just open for me and I am able to edit all the settings in separate process in the model. After I finished my editing I just click on the RUN button on the bottom of the screen, which start the experiment. After complete I am able to visualize the results of the experiments by click on the Evaluate Model process by right button and then click on visualize. Next screens shows this experiment as well as the visualization of the result after the end. If you would like to build your own experiment just use search bar on the left upper side and search for everything you want. There are a lot of data saved datasets, possibility of implementation of data reader (reader of data from external source), data transformations, machine learning procedures, or modules, where you are able to enter own R code (R language). For me the most important modules/procedures are statistical ones, because with combination with the others, you can build the self-learning automation process for almost anything (in my case for statistical analysis with predictable functions).
Let’s begin with own model. On the home ML Studio section, select NEW and then blank model.
It is important to say, that next steps were replicated from Microsoft official guides.
- First of all select some source of data
If you click using right button on the small circle on the shape, you can select visualize in the menu and look, what data is in the dataset
- Second of all I will select some data transformation, for example missing value scrubber and Project Columns – as you can see, the project columns shape indicates, that something is wrong. What? If you click on this shape, you can see on the right side (shape menu), and that you can select Project Columns. After column selection warning disappear.
- Then we need to split the dataset by rows into two parts. Let’s add this procedure from the menu on the left from data transformation\sample and split section and set “Fraction of rows in the first output dataset” to 0.8. Then add another split procedure and this time set the same parameter to 0.75.
- After that we need to add some machine learning procedures, because of predictions etc. So add the “Two-Class Support Vector Machine” from the Machine Learning\Initialize Model\Classification section and “Train Model” from the Machine Learning\Train section and then connect last split procedure with train model and two-class support vector machine with train model as well. If you look on the train model shape, there is another warning icon. You need to launch column selector and because this is a financial learning, we will select income column.
- Last thing to add is the score model. We need to know how the progress is making, so from the Machine Learning section select score model, add the shape to diagram and make connection between score model and split connector and between score model and train model shape.
- If we then click on PLAY button and wait for the results, then click on the score model shape and then on visualize, we should see something like this:
As you can see there is 2 new columns named “Scored labels” and “Score Probabilities”. So something is happening there. Ok, let’s move on.
- Now add the Evaluate model from Machine Learning\evaluate section, make connection between this process and score model and click on PLAY again.
- Let’s compare this model with another one. We can add another source instead of two-class vector machine. Let’s select “two-class logistic regression” from machine learning\initialize model\classification section.
- Because we need come comparation, let’s copy two another parts from the previous steps. Select Train model and score model – press CTRL+C, CTRL+V and delete input connection from the two-class support vector machine part. Last connect new two-class logistic regression instead of two-class support vector machine to copied train model part, then connect score model with evaluate model and finally click on play button again.
After visualization of the evaluate model the graph and table should looks like this:
Good example of difference is AUC value, which was 0.887 un the first model/scenario and 0.899 in the second case/scenario – you can click in the visualization on the “second dataset to compare” next to graph.
- Let’s make another comparation. Add another classification part named “two-class boosted decision tree”, copy again the “train model” and “score model” parts and change connection to train model to two-class boosted decision tree instead of two-class logistic regression. Then add another evaluate model part. Everything should looks like on next screen, then run play again and select visualize of the second evaluate model part.
In the visualization we can see, that AUC value is a little bit higher – this time 0.925
- Let’s find optimum parameter by sweep method. Select sweep parameters part from machine learning\train section. Connect to this method “two-cross boosted decision tree” and two connectors from the first split. Next there is a need to select columns in the sweep parameters, so click on this last added method and in the properties window on the right side launch column selector and select income column and click ok.
- Then choose random sweep mode, minimum number of runs to 10 and select AUC as a metric for measuring performance.
After this select the sweep parameters method and copy it twice. First copy should be connected to first split and two-class support vector machine, second one should be connected to first split and two-class logistic regression and finally third one is already done in previous steps. The final screen should looks like this:
Now if you click on play button again, the model should actualize and now the important part J We need to select the optimum, so click on visualize the on the last sweep parameters (most in right) method and visualize it. We concentrate on AUC column and we are searching for higher value. The raw, which begins with number 9 is our candidate. So let’s copy all the values from this column to somewhere.
Now let’s use these values. Add another two-class decision tree and fill up the properties with these values.
Add a train model and create connection between the train model and two-class decision tree and connect to tree first split (split method after project columns one) as well. In train model select column to income. Add a score model and create connection between train model and score model and create another connection between first split (after project columns method) and this score model. Then add evaluate model and create connection between evaluate model and score model. Click on RUN.
If you now click on visualize of the evaluate model (most right and most new one) you can see, that the final testing value of AUC is 0.915.
Last we can compare also all the values from the last score model and from the two-class boosted decision tree part of our model. Let’s take score model under the sweep parameters from the two-class boosted decision tree part and connect the sweep parameters and first split into this score model and finally connect both last score models into evaluate model for compares as it is on the next screen.
Now if we click on run again, the model will finish and the final visualization from our comparation shows us, that there is no difference between two last calculations. So in the other words, we were succesfull.
Thank you for reading part II and hope that you will read part III as well.