The 2024 EMA study reveals an alarming trend: the average cost per minute of IT outages for organizations with fewer than 10,000 employees has increased by 60% since 2022. The likely causes are companies' increasing reliance on IT, staff shortages, and delayed technology modernization.
Alexandr Hacicheant, Head of Reliability Engineering at Mayflower, is a specialist responsible for the reliability, high availability, and fault tolerance of the company's products. He helps the company to prevent failures, as well as to eliminate their consequences quickly. Over the years, Alexandr has implemented approaches that have solved scaling problems, increased the stability of IT systems, and improved the quality of life for developers. In the interview, he talked about technologies for IT optimization, trends in the industry, and ways to form an effective team.
— You joined Mayflower, a company that created a platform for live streaming eight years ago. How has your work changed over the years?
— I started as a backend developer in a small team of about 15 people. Together with my colleagues, we launched features for a new platform to test their relevance in the market. For the first few years, we actively developed them and delivered them to the product, but as we grew, we started to face problems with load scaling. So, my focus shifted to finding bottlenecks and optimizing them.
I became more involved in technical issues that allowed us to grow without rapidly scaling infrastructure and buying new servers by optimizing solutions and improving the technology stack. In particular, I convinced the management to allocate 30% of development time to work with technical debt (i.e., to solve problems accumulated in program code or architecture).
Prior to some of my initiatives, it was not uncommon for developers to be woken up by late-night calls to restore services due to scalability challenges. This was leading to employee fatigue and burnout. We needed to look for a formula that would maintain a work-life balance.
Under my leadership, we were able to integrate monitoring tools for more detailed analysis of application behavior under load, and we started to implement SRE practices (practices of using software tools for monitoring, change and incident management, and documentation around IT infrastructure services). As a result, engineers on the teams responsible for the services themselves were able to keep them running sustainably with monitoring tools and flexibly configured alerts.
— What are you responsible for at the company as Head of Reliability Engineering?
— Now, my goal is to ensure the stability and fault tolerance of the platform and services so that the latter work without failures and are promptly restored if such failures do occur. Accordingly, I need to identify bottlenecks in the IT infrastructure and prevent them from occurring in the future. Among other things, I provide the teams with the right knowledge and guidance. For example, I hold meetups on how services are organized so that employees can easily troubleshoot after releases. Under high load, when thousands of users are already working with new functionality, it is important to be able to quickly analyze the problem and decide whether to roll back changes or hotfix, restore the system, and find out the cause of the failure.
The work of my team has greatly reduced the number of service outages. Previously, major failures could happen every week; now, they happen at most once a month. Plus, the developers, especially the team leaders, work with less stress thanks to well-functioning processes and scheduled duties.
In the past, team leaders and technical leaders were independently responsible for their own areas and domains. However, due to the development and complication of the services' functionality, such a scheme was no longer optimal. In order to increase efficiency and prevent burnout of responsible employees, my team and I deployed three technical support lines.
The first line is the monitoring team, which monitors services around the clock and prevents possible failures. The second line is the SRE-team, consisting of developers and DevOps engineers of a particular service. And the third line—tech leads and team leads who have the broadest possible knowledge of the service. In case the team on the first or second line lacks the competencies or authority to solve the problem, it can always be "escalated" to the next level.
Initially, after the implementation of the new scheme of work, problems were often referred to the third line. However, as the monitoring and SRE teams gained experience in these issues, prepared post-mortems based on the results of incident resolution, and developed action items to prevent similar situations in the future, there were fewer escalations and problems.
— What kind of projects are you currently running?
— Right now, I have two main areas of activity. The first is dividing a large service (a monolith) into several smaller ones. I am also responsible for finding specific ways of dividing it into microservices so that the process does not stretch on for years.
A monolithic infrastructure is a single, complete system where all components of an application are connected to each other. This simplifies development but makes it very difficult to scale and update individual parts. Microservices, on the other hand, divide the application into independent modules. Each is responsible for a specific function and can be deployed or modified independently of other services.
Accordingly, the implementation of microservice infrastructure is important for faster development of our products and their efficient upgrades. My role here is to guide teams in choosing technologies and architectural solutions. I also organize internal meetups for knowledge sharing and work planning.
The second area of my work is improving the monitoring and observability of services through interaction with the development and operations teams. It is important that data does not have to be collected manually. The system should understand where and what is broken and immediately send a signal to the right person. The ultimate goal is for the automated alert system to help identify problems quickly and direct them to the relevant teams.
— Can you give an example of an innovative solution you have implemented in the company?
— One of our company's projects is a media server designed to minimize latency in video streaming. A delay of several seconds occurs when HLS (HTTP Live Streaming) is used, which is a popular technology for Video On Demand, used by Netflix in particular. This latency is acceptable when watching videos, but for interactive platforms like Twitch, it ruins the user experience. For real-time video viewing, another technology—WebRTC—is suitable.
WebRTC runs on the UDP protocol, where data packets are sent without acknowledgment of receipt. This can cause them to be lost and cause glitches, as in Zoom when the picture quality drops. But in case of stable internet, WebRTC provides instant response. Platforms can switch between HLS and WebRTC depending on the internet quality to provide an optimal experience for the client: if the internet bandwidth is low, HLS is used; if it is high, WebRTC is used.
At a certain stage, we felt that third-party solutions did not meet all of our company's needs. Even with paid support from vendors, the speed of bug fixes was low, and vendors could not cover all of our business requests due to the complexity of their implementation. As a result, we started developing our own media server using different technologies.
It was a long game, and our expectations were met. The product has become a key business advantage and the best solution in terms of performance, speed of delivery of new features, and fixing issues. It made the end solution more convenient for consumers and improved the media quality on the platform.
As part of this project, I had to define a technology stack, assemble a team of developers, DevOps, and QA engineers, build a technical roadmap, and successfully integrate the internally developed media server instead of a third-party solution.
— As an expert with extensive experience in management, tell us what approaches you use to hire employees effectively.
— I will tell you about the practices I use at job interviews. Usually, the conversation takes about an hour: I get to know the candidate, talk about the product and tasks, and then we move on to the technical part. Sometimes, the conversation doesn't go well, and in such cases, I give immediate feedback and explain why we won't work together yet. It's respectful of the candidate's time and mine. I often get positive feedback: people thank me for my openness and honesty and understand where they need to improve.
I also note if a candidate shows a lack of interest. It happens—people may be interviewing to pump up skills or for experience. In general, I don't hesitate to look for common ground. It's important that we both understand if we are a good fit for each other.
Another important point is soft skills. I am interested in what a person does in his spare time, whether he is self-educated, and how he maintains a balance between work and personal life. This helps to avoid burnout. And, of course, it is important to understand why a candidate is coming to us—for a salary or something more. I always seek to clarify the motivation. In addition, if there are any questions, I will ask for references from colleagues from previous jobs.
When it comes to hard skills, within the team we can overlook some gaps in this area if the candidate shows enough enthusiasm and that he or she can learn. Such a specialist is worth considering in the future.
— You mentioned the need for self-education for IT professionals. Why is it so important in the industry?
— The last decade has seen revolutionary changes in IT. Technology is evolving so rapidly that what was relevant yesterday is now obsolete. Example: AI offers many times more efficient approaches to tasks that were previously considered solved.
To stay relevant, it's important for businesses to keep up with trends and update their technology stack. It is not always about global upgrades of IT systems—many companies start with small experiments with new technologies and test them on less critical tasks. If the result is successful, they implement them on a broader level.
Accordingly, knowledge of current technologies allows specialists, on the one hand, to move up the career ladder within the company, on the other hand, opens up broader prospects in the external market.
— What would you recommend to IT professionals who want to improve their skills on their own?
— Many companies today are putting their projects into open-source. In my opinion, participation in them is a great opportunity to share experience and learn innovations. It helps developers to solve complex business problems, for example, to process large amounts of data more efficiently and to cope with the high load on services using different ways of scaling.