Strong “caffe” for breakfast and visiting hackathons: why it is important for the development of the Data Science community

I am a Data Scientist on the Data Lake Platform team at Raiffeisenbank. Three years ago, the bank did not have a Big Data business, and now we have a separate platform for working with big data and an actively developing community. As the data driven culture develops, we face many questions: technical, communication, and more.

In the article I want to tell how our community Raiffeisen Data University helps to solve some of them.

Scalability issues

A couple of years ago, all Data Scientists lived separately, each in their own tasks - no one thought of any community. There are more and more ideas requiring knowledge in the field of data analysis, as well as divisions with Data Scientists in the state.

Different difficulties began to appear:

Communication among DS:
- it is not known which business cases colleagues now work with;
- each team is sawing its own bike to implement the same functionality.

Technical side:
- the search for input data for modeling is opaque;
- the code does not play on new data;
- cluster resources are not used optimally;
- the process of outputting a model into a product is not unified.

Interaction with business customers:
- not all customers have an idea of what can be solved with
  
  machine learning, what are the limitations and how to set the task.

On which side to approach these issues and begin the development path to a mature data-driven company? You can come up with different strategies: collect all Data Scientists in one large department or add all Chiefs to all teams and hire another chief Chief who would build a vector of development. We decided to go the other way.

So the idea of Raiffeisen Data University - RDU was born. This is not a university in its standard understanding, it is a flexible mechanism that helps Data Scientists solve their problems through the organization of various activities. How does he succeed?

All ingenious is simple

First, it was necessary to introduce and synchronize people from different business divisions. The simplest thing that comes to mind is to arrange a meeting.

The first one took place about two years ago, at it Data Scientists from different departments met, who then did not know about each other's existence. Now mitaps have become commonplace. We meet new colleagues at them, share solved cases or what is in the process. You can throw your ideas to the speaker, ask tricky questions about metrics or data quality. Or you can organize a workshop on hands-on tools that were included in their project. A variety of specific topics are raised: how the CI / CD model is arranged in the product, the architecture of the model of the solved case, the statement of the problem from the business and the complexity of the solution, and many others. Previously, everything took place in a secret audience, where they only allowed those who had passed the rite of passage.

Now we have already accumulated useful experience that can be shared. Internal mitaps help us solve communication and technical difficulties. And together with the ML REPA project, the first open meeting was held for everyone.

Strong caffe for breakfast

Mitapas require some preparation and happen approximately once a month or two. And something new and interesting happens all the time, that's why we meet at Data Science breakfast to maintain communications. The number of participants varies ~~who woke up on time~~ .

At breakfast, in addition to goodies and positive emotions from talking with like-minded people, you get a bunch of useful information about new libraries and algorithms, you solve your problem with the application architecture, or find out what resources will soon be thrown into the cluster. The profit from such short meetings is sometimes no less than from large mitaps.

Learning rate improvement

“Even more profit, even more knowledge!” - we openly wished. So there was a competitive element - gaps, as we call them. They were inspired by the idea of machine learning training in Yandex, customizing to their needs and capabilities. The open data competition starts for approximately three weeks:

in the first week we all meet and throw up possible solutions for the solution (very similar to DMIA sports workshops);
in the second week - an intermediate meeting: we analyze who has some kind of plugs, we are motivated to decide further;
followed by a debriefing, announcement of winners, discussion of what went and what didn’t.

In the framework of one competition, we try to concentrate on one topic: dirty data, time series, text analysis. Everyone chooses tools that he is interested in trying, but still have not decided, or what should bring maximum results on the leaderboard. The coolest part was on Reinforcement learning - you had to train your agent to interact with the Atari environment. To summarize, the organizers of the competition arranged for us a battle between bots and people in three games - Packman, Break out, Space Invaders.

As a result, people won at Packman by a wide margin, in the rest - humanity lost to Skynet.

Discover the Data Scientist

Managers, too, were not left alone. An internal hackathon of one day for all those who are connected with analytics, but have a poor understanding of how data is organized is a good opportunity to quickly plunge into the kitchen of Data Science tasks. At the beginning of the day, a review lecture on concepts, algorithms, and the most common metrics in classification and regression problems is held. After this, a real case is considered, which participants are invited to solve on our data. The decision time is about 4 hours, therefore, to make things work productively, one Data Scientist is sent to help each team.

I was at one of these hackathons as the hands that will implement the ideas proposed by managers, as well as direct reasoning in a constructive direction. The task required to build a model of customer outflow based on real data for six months (the condition of the outflow was specified), as well as to estimate what economic effect this model would bring. Everything went wrong with us during the decision, pieces of code broke from start to start - this allowed the team to feel the whole complexity of feature engeniering, but there were a lot of ideas that Data Scientist might not have guessed about right away due to lack of business experience .

Thanks to such events, managers learn to more objectively evaluate the deadlines for completing DS tasks, learn about the pitfalls and the importance of the originally set quality metric. And Data Scientist allows you to understand the vision of the task through the eyes of the manager, to determine what points should be highlighted immediately at the beginning of collaboration.

The strongest will survive

But the most interesting thing usually happens in September, when the DS-team leaves for a two-day hackathon in the countryside, in a very picturesque place with convenient infrastructure. The organizers invite external experienced mentors for us at the hackathon. Last year, Emeli Dral and Alexander Gushchin prepared a task to determine the genre of a film from a dialogue from it. Almost 40 thousand dialogues of the training sample, 20 different genres from 438 films - it was about films with English subtitles.

We listened to a brief excursion on the topic of NLP: text preprocessing methods, simple and more tricky learning approaches using DL; We separately talked about teamwork in ML projects - how to organize code and how it saves time. While listening to presentations, the most active have already downloaded fasttext and glove embeddings to their laptops.

After the lecture, a competition began in the kaggle inclass format with public / private leaderboard. We broke into teams - the maximum shuffle so that the team did not even have two people from the same department. There was 24 hours for everything about everything.

Someone started a remote home server, someone rushed to deploy the environment in the clouds, there were even those who dragged the system unit with them - they tried as best they could! Over the course of a day, the teams generated a wide variety of ideas for solving: from using Elastic Search to find similar texts to the glazed results of ensembles of models that cannot be reproduced ~~on a sober head~~ the next day.

To summarize and compare the work of models, in addition to scoring on a private leaderboard, we decided to arrange an interactive demo - see how the models wrapped in services work. The organizers approached this with humor and included a fragment from the movie "The Fifth Element", where the text seems to be something terrible, but in fact there is a funny scene with Chris Tucker . Most of the models made a mistake on this and predicted a thriller, a drama, but not a comedy.

As a result, the ensemble of linear models, boostings with hand-crafted features on the basis of clustering and other shamanistic transformations won, neurons were present in solutions 2 and 3 of the place. In addition to cool prizes (the main prize is a trip to NIPS or another cool conference), you return from the hackathon with new friends you have tested in battle, who will share knowledge and skills with you. In the end, I didn’t even want to leave this place with picturesque nature and a cozy company.

Instead of a conclusion

In this article, I shared the challenges of becoming a Data Science culture in a company and how Raiffeisen Data University helps Data Scientists along the way.

Of course, not all problems have been resolved, but now we have a more cohesive and mature data-community than we did a couple of years ago, and we are ready to solve new challenges that confront us.

It is very interesting whether there were similar problems in your work, who solved them and how?

Maybe someone will share life hacks from their experience? ;)

All Articles