🏬 🏼 🆑 "The long road is waiting for you ..." or solving the forecasting problem in C # using Ml.NET (DataScience) 📩 👓 📑

Recently, information about the Ml.NET machine learning framework has come to my attention more and more often. The number of references to it grew into quality, and I decided to look at least with one peep at a glance what kind of animal this is.

Earlier, you and I have already tried to solve the simplest prediction problem using linear regression in the .NET ecosystem. For this, we used the Accord.NET Framework. For these purposes, a small data set was prepared from open data on citizens' appeals to executive authorities and personally to the mayor of Moscow .

After a couple of years on an updated dataset, we will try to solve the simplest problem . Using the regression model in the Ml.NET Framework, we predict how many requests per month get a positive solution. Along the way, we will compare Ml.NET with Accord. NET and Python libraries.

Do you want to seize the strength and power of the predictor? Then you are welcome under cat.

PS Let S.S. Sobyanin, the article will not say a word about politics.

Content:

Part I: introduction and a bit about data

Part II: writing C # code

Part III: Conclusion

I think it is necessary to immediately warn you that I'm not a pro in data analysis and programming in general, nor am I engaged with the Moscow City Hall. Therefore, the article is more likely from a beginner to a beginner. But despite my limited knowledge, I hope the article will be useful to you.

People who are already familiar with past articles from the cycle may recall that we have already tried to solve the problem of predicting the number of positively resolved issues from the appeals of citizens addressed to the executive branch of Moscow. For this, we used the Python and Accord.Net Framework .

Other cycle articles

1. Learn the basics:

2. We practice the first skills

In any case, it will not be superfluous to parse the data set used again.

All article materials, including code and the data set, are freely available on GitHub .

The data on GitHub is presented in csv format, contains 44 entries and, in principle, they can (and should) be used not only for analysis of the example.

Data columns mean the following:

num - record index
year - year of record
month - month of recording
total_appeals - total number of hits per month
appeals_to_mayor - total number of appeals to the Mayor
res_positive- number of positive decisions
res_explained - the number of calls for clarification
res_negative - number of calls with a negative decision
El_form_to_mayor - the number of appeals to the Mayor in electronic form
Pap_form_to_mayor - the number of appeals to the Mayor on paper to_10K_total_VAO ... to_10K_total_YUZAO - the number of appeals per 10,000 population in various districts of Moscow
to_10K_mayor_VAO ... to_10K_mayor_YUZAO– the number of appeals to the Mayor and the Government of Moscow per 10,000 population in various districts of the city

I did not find a way to automate the data collection process and collected them manually, so I could be slightly mistaken. The rest of the reliability of the data will be left to the conscience of the authors.

At the moment, on the Moscow government website in full, the data is presented from January 2016 to August 2019 (in September, some data are missing). Thus, we will have 44 records. A little, of course, but this will be enough for us to demonstrate.

Before you begin, just a few words about the hero of our article.

ML.NET Framework - Microsoft's open source development. According to social media advertising, this is their answer to Python machine learning libraries. The framework is cross-platform and allows you to solve a wide range of problems from simple regression and classification to deep learning. On Habrr comrades already carried out the analysis of ML.NET and libraries on Python. Who cares, here is the link .

I will not give a detailed guide to installing and using Ml.NET because, in essence, everything was ~~ripped off~~ "adapted" based on a textbook from the official Microsoft website . There, the problem with the prices of a trip in a taxi was solved, and to be honest, there are more benefits from it

But I think the small explanations will not be redundant.

I used Visual Studio 2017 with the latest updates.

The project was based on the .NET Core console application template (version 2.1).

The project had to install NuGet packages Microsoft.ML, Microsoft.ML.FastTree. That, in fact, is the whole preparation.

Let's go directly to the code.

To start, I created the MayorAppel class, in which I described in order the columns with data from csv files.

How not hard to guess [LoadColumn (0)]

- tells us which column from the csv file we take.

Next, following the tutorial, I created the MayorAppelPrediction class - for prediction results

Despite the fact that almost all the columns in the data set have integer values, in order to avoid an error at the stage of gluing data in the pipeline, I had to assign them a float type (so that all data types are the same).

The listing is large enough, so put it under the spoiler.

Class code for data description

using Microsoft.ML.Data; namespace app_to_mayor_mlnet { class MayorAppel { [LoadColumn(0)] public float Year; [LoadColumn(1)] public string Month; [LoadColumn(2)] public float TotalAppeals; [LoadColumn(3)] public float AppealsToMayor; [LoadColumn(4)] public float ResPositive; [LoadColumn(5)] public float ResExplained; [LoadColumn(6)] public float ResNegative; [LoadColumn(7)] public float ElFormToMayor; [LoadColumn(8)] public float PapFormToMayor; [LoadColumn(9)] public float To10KTotalVAO; [LoadColumn(10)] public float To10KMayorVAO; [LoadColumn(11)] public float To10KTotalZAO; [LoadColumn(12)] public float To10KMayorZAO; [LoadColumn(13)] public float To10KTotalZelAO; [LoadColumn(14)] public float To10KMayorZelAO; [LoadColumn(6)] public float To10KTotalSAO; [LoadColumn(15)] public float To10KMayorSAO; [LoadColumn(16)] public float To10KTotalSVAO; [LoadColumn(17)] public float To10KMayorSVAO; [LoadColumn(18)] public float To10KTotalSZAO; [LoadColumn(19)] public float To10KMayorSZAO; [LoadColumn(20)] public float To10KTotalTiNAO; [LoadColumn(21)] public float To10KMayorTiNAO; [LoadColumn(22)] public float To10KTotalCAO; [LoadColumn(23)] public float To10KMayorCAO; [LoadColumn(24)] public float To10KTotalYUAO; [LoadColumn(25)] public float To10KMayorYUAO; [LoadColumn(26)] public float To10KTotalYUVAO; [LoadColumn(27)] public float To10KMayorYUVAO; [LoadColumn(28)] public float To10KTotalYUZAO; [LoadColumn(29)] public float To10KMayorYUZAO; } public class MayorAppelPrediction { [ColumnName("Score")] public float ResPositive; } }

Let's move on to the main program code.

Do not forget to add at the very beginning:

 using System.IO; using Microsoft.ML;

The following is a description of the data fields.

 namespace app_to_mayor_mlnet { class Program { static readonly string _trainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "train_data.csv"); static readonly string _testDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "test_data.csv"); static readonly string _modelPath = Path.Combine(Environment.CurrentDirectory, "Data", "Model.zip");

In these fields, in fact, the paths to the data files are stored, this time I decided to separate them in advance (unlike the case with Accord.NET)

By the way, if you are going to do your project, do not forget to set the option “Copy later version” in the properties of the data files in order to avoid an error due to the lack of assembly files.

Next comes the challenge of the methods that form the model, conduct its assessment and give us a prediction.

  static void Main(string[] args) { MLContext mlContext = new MLContext(seed: 0); var model = Train(mlContext, _trainDataPath); Evaluate(mlContext, model); TestSinglePrediction(mlContext, model); }

Let's go in order

The Train method is needed to train the model.

 public static ITransformer Train(MLContext mlContext, string dataPath) { IDataView dataView = mlContext.Data.LoadFromTextFile<MayorAppel>(dataPath, hasHeader: true, separatorChar: ','); var pipeline = mlContext.Transforms.CopyColumns(outputColumnName: "Label", inputColumnName: "ResPositive") .Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "MonthEncoded", inputColumnName: "Month")) .Append(mlContext.Transforms.Concatenate("Features", "Year", "MonthEncoded", "TotalAppeals", "AppealsToMayor", "ResExplained", "ResNegative", "ElFormToMayor", "PapFormToMayor", "To10KTotalVAO", "To10KMayorVAO", "To10KTotalZAO", "To10KMayorZAO", "To10KTotalZelAO", "To10KMayorZelAO", "To10KTotalSAO", "To10KMayorSAO" , "To10KTotalSVAO", "To10KMayorSVAO", "To10KTotalSZAO", "To10KMayorSZAO", "To10KTotalTiNAO", "To10KMayorTiNAO" , "To10KTotalCAO", "To10KMayorCAO", "To10KTotalYUAO", "To10KMayorYUAO", "To10KTotalYUVAO", "To10KMayorYUVAO" , "To10KTotalYUZAO", "To10KMayorYUZAO")).Append(mlContext.Regression.Trainers.FastTree()); var model = pipeline.Fit(dataView); return model; }

At the beginning, we read the data from the training sample. Then in the chain we determine the parameter that will predict (label).

In our case, this is the number of successfully resolved issues regarding citizens' appeals per month.

Since in this case the model of boosting decision trees based on regression is used, we need to bring all the signs to numerical values.

Unlike the case with Accord.NET, the ready-made OneHotEncoding solution is immediately presented here in the documentation.

After it remains to form the columns, as I said above, they should all be of the same data type, in this case, a float.

In conclusion, we form and return the finished model.

Next, we evaluate the quality of the prediction by our model.

  private static void Evaluate(MLContext mlContext, ITransformer model) { IDataView dataView = mlContext.Data.LoadFromTextFile<MayorAppel>(_testDataPath, hasHeader: true, separatorChar: ','); var predictions = model.Transform(dataView); var metrics = mlContext.Regression.Evaluate(predictions, "Label", "Score"); Console.WriteLine(); Console.WriteLine($"*************************************************"); Console.WriteLine($"* Model quality metrics evaluation "); Console.WriteLine($"*------------------------------------------------"); Console.WriteLine($"* RSquared Score: {metrics.RSquared:0.##}"); Console.WriteLine($"* Root Mean Squared Error: {metrics.RootMeanSquaredError:#.##}"); }

We load our test sample (the last 4 months from the set), we obtain the prediction of our test data on the trained model using the Transform () method. Then we calculate the metrics and print them. In this case, it is the coefficient of determination and standard deviation. The first should ideally tend to 1, and the second essentially to zero.

In principle, in order to make a prediction, we did not need this method, but it’s nice to understand how badly our model predicts something.

The last method remains - the prediction itself.

We will also hide it under the spoiler.

prediction method and data

 private static void TestSinglePrediction(MLContext mlContext, ITransformer model) { var predictionFunction = mlContext.Model.CreatePredictionEngine<MayorAppel, MayorAppelPrediction>(model); var MayorAppelSampleMinData = new MayorAppel() { Year = 2019, Month = "August", ResPositive = 0 }; var MayorAppelSampleMediumData = new MayorAppel() { Year = 2019, Month = "August", TotalAppeals = 111340, AppealsToMayor = 17932, ResExplained = 66858, ResNegative = 8945, ElFormToMayor = 14931, PapFormToMayor = 2967, ResPositive = 0 }; var MayorAppelSampleMaxData = new MayorAppel() { Year = 2019, Month = "August", TotalAppeals = 111340, AppealsToMayor = 17932, ResExplained = 66858, ResNegative = 8945, ElFormToMayor = 14931, PapFormToMayor = 2967, To10KTotalVAO = 67, To10KMayorVAO = 13, To10KTotalZAO = 57, To10KMayorZAO = 13, To10KTotalZelAO = 49, To10KMayorZelAO = 9, To10KTotalSAO = 71, To10KMayorSAO = 14, To10KTotalSVAO = 86, To10KMayorSVAO = 27, To10KTotalSZAO = 68, To10KMayorSZAO = 12, To10KTotalTiNAO = 93, To10KMayorTiNAO = 36, To10KTotalCAO = 104, To10KMayorCAO = 24, To10KTotalYUAO = 56, To10KMayorYUAO = 12, To10KTotalYUVAO = 59, To10KMayorYUVAO = 13, To10KTotalYUZAO = 78, To10KMayorYUZAO = 23, ResPositive = 0 }; var predictionMin = predictionFunction.Predict(MayorAppelSampleMinData); var predictionMed = predictionFunction.Predict(MayorAppelSampleMediumData); var predictionMax = predictionFunction.Predict(MayorAppelSampleMaxData); Console.WriteLine($"**********************************************************************"); Console.WriteLine($"Prediction for August 2019"); Console.WriteLine($"Predicted Positive decisions (Minimum Features): {predictionMin.ResPositive:0.####}, actual res_positive : 22313"); Console.WriteLine($"Predicted Positive decisions (Medium Features: {predictionMed.ResPositive:0.####}, actual res_positive : 22313"); Console.WriteLine($"Predicted Positive decisions (Maximum Features): {predictionMax.ResPositive:0.####}, actual res_positive : 22313"); Console.WriteLine($"**********************************************************************"); }

In the example, we used the PredictionEngine class, which allows us to obtain a single prediction based on the trained model and test data set.

We will create three “probes” with data for prediction.

The first with a minimum set of data (only a month and a year), the second with an average and the third with a full set of characteristics - respectively.

We get three different predictions and print them.

As you can see in the screenshot (Windows 10 x64), adding data on the number of calls per 10,000 residents in the districts, in this case, only spoils everything, but adding the rest of the data gives a small increase to the accuracy of the prediction.

Under Linux, Mint 19 also compiles wonderfully in Mono.

It turns out that the framework is quite cross-platform.

In conclusion, as promised, I will give a little subjective comparative analysis of ML.NET with Accord.NET and Python machine learning libraries.

1. It is felt that the developers are trying to comply with trends in the field of machine learning. Of course, in Python with a bunch of libraries installed in Anaconda, this task could be solved more compactly and spend less time on development. But overall, it seems to me that ML.NET's approach to solving problems is friendly to people who are used to solving machine learning problems using Python.

2. Compared with the Accord.NET Framework - ML.NET looks more convenient and promising for a person who has tried machine learning in Python. I remember when I tried to write something on Accord.NET two years ago, I was terribly lacking explanations and examples for some classes and methods. In this regard, Ml.NET has a slightly better situation with documentation, despite the fact that the framework is much younger than Accord.NET. Another important factor is that ML.NET judging by the activity on GitHub is developing much more intensively than Accord.NET and has more Russian-language training materials.

As a result, at first glance, ML.NET looks like a convenient tool that complements your arsenal if you cannot use Python or R (for example, when working with CAD APIs executed on .NET).

Have a good working week!

"The long road is waiting for you ..." or solving the forecasting problem in C # using Ml.NET (DataScience)

More articles: