🥇Ilya Segalovich Prize. Story about computer science and publications on the occasion of the launch

Today we are launching the Ilya Segalovich Science Prize iseg. It will be awarded for achievements in the field of computer science. Undergraduate and graduate students can submit their own application for the award or nominate research supervisors. The winners will be selected by representatives of the academic community and Yandex. The main selection criteria: the presence of publications and presentations at conferences, as well as contribution to the development of the community.

The first award will take place in April. Within the framework of the award, young scientists will receive 350 thousand rubles each, and in addition, they will be able to go to an international conference, work with a mentor and undergo an internship in the Yandex Research Department. Scientific supervisors will receive 700 thousand rubles each.

On the occasion of the launch of the award, we decided to talk here on Habré about the criteria for success in the world of computer science. Some of Habr's readers are already familiar with these criteria, while the rest could get a false impression about them. Today we will bridge this gap - we will touch on all the main topics, including articles, conferences, datasets and the transfer of scientific ideas to services.

For scientists in the field of computer science, the main criterion for success is the publication of their scientific work at one of the top international conferences. This is the first "checkpoint" of recognition of the researcher's work. For example, in the field of machine learning in general, the International Conference on Machine Learning (ICML) and the Conference on Neural Information Processing Systems (NeurIPS, formerly NIPS) are distinguished. There are many conferences on specific areas of ML, such as computer vision, information retrieval, speech technology, machine translation, etc.

Why publish your ideas

People far from computer science may have a delusion that it is better to keep the most valuable ideas secret and strive to benefit from their uniqueness. However, the real situation in our sphere is exactly the opposite. The authority of a scientist is judged by the significance of his work, by how often other scientists refer to his articles (citation index). This is an important characteristic of his career. A researcher moves up the professional ladder, becoming more respected in his environment, only if he constantly produces strong works that are published, become known and form the basis of the work of other scientists.

Many of the top articles (perhaps most of them) are the result of collaborations between researchers at different universities and companies around the world. An important and very valuable moment in the career of a researcher is the moment when he gets the opportunity to find and weed out ideas on his own based on his experience - but even after that, colleagues continue to provide him with invaluable help. Scientists help each other work out ideas, write articles in co-authorship - and the greater the contribution of a scientist to science, the easier it is for him to find like-minded people.

Finally, the density and availability of information is now so great that different researchers simultaneously have very similar (and indeed valuable) scientific ideas. If the idea is not published, someone will almost certainly publish it for you. The “winner” is often not the one who came up with the innovation a little earlier, but the one who published it a little earlier. Or - the one who managed to reveal the idea as fully as possible, clearly and convincingly.

Articles and datasets

So, a scientific article is built around the main idea that the researcher proposes. This idea is his contribution to computer science. The article begins with a description of the idea, formulated in a few sentences. This is followed by an introduction that describes the range of problems solved by the proposed innovation. The description and introduction are usually written in simple language that can be understood by a wide audience. After the introduction, it is necessary to formalize the stated problems in mathematical language, to introduce strict notation. Then, using the introduced notation, you need to make a clear and exhaustive presentation of the essence of the proposed innovation, to indicate differences from previous, similar methods. All theoretical calculations must either be supported by references to previously compiled evidence, or proved independently. This can be done with some assumptions. For example, one can give a proof for the case when there are infinitely many data in training (an obviously unattainable situation) or they are completely independent of each other. Toward the end of the article, the scientist talks about the experimental results that he managed to get.

To be more likely to be approved by the conference organizers, the reviewers must have one or more attributes. The key factor that increases the chances of approval is the scientific novelty of the proposed idea. Often, novelty is evaluated relative to existing ideas - and the work on its evaluation is performed not by the reviewer, but by the author of the article himself. Ideally, the author should describe the existing methods in detail in the article and, if possible, present them as special cases of his method. Thus, the scientist shows that the accepted approaches do not always work, that he generalized them and proposed a broader, more flexible and therefore more effective theoretical formulation. If the novelty is undeniable, then the rest of the reviewers evaluate the article not so meticulously - for example, they can turn a blind eye to bad English.

To reinforce the novelty, it is useful to add to the article a comparison with existing methods on one or more datasets. Each of them must be open, accepted in the academic environment. For example, there is an ImageNet image repository and databases from institutions such as the Modified National Institute of Standards and Technology (MNIST) and CIFAR (Canadian Institute for Advanced Research). The difficulty is that such an “academic” dataset often differs in content structure from the real data that the industry deals with. Different data - different results of the proposed method. Scientists, partly working for the industry, try to take this into account and sometimes insert reservations like “on our data the result is such and such, but on the public dataset it is such and such”.

It happens that the proposed method is completely "sharpened" for an open database and does not work on real data. This common problem can be combated by discovering new, more representative datasets, but often it is about private content that companies simply do not have the right to open. In some cases, they carry out (sometimes complex and painstaking) anonymization of data - they remove any fragments that point to a specific person. For example, faces and numbers in photographs are erased or rendered illegible. In addition, in order for the dataset not only to be available to everyone, but to become a standard among scientists, on which it is convenient to compare ideas, it is necessary not only to publish it, but also to write a separate cited article about it and its advantages.

It is worse when there are no open datasets in the researched topic. Then it remains for the reviewer to take on faith the results presented by the author. Theoretically, the author can even overestimate them and remain uncaptured, but in an academic environment this is unlikely, since it runs counter to the desire of the vast majority of scientists to develop science.

In a number of areas of ML, including computer vision, it is also customary to attach links to code (usually GitHub) to articles. There is either very little code in the articles themselves, or it is pseudocode. And here, again, difficulties arise if the article is written by a researcher from a company, and not from a university. By default, code written by a corporation or startup is NDA-marked. Researchers and their colleagues have to make a lot of effort to separate the code related to the idea being described from internal and certainly closed repositories.

The chance of publication also depends on the relevance of the chosen topic. Relevance is largely dictated by products and services: if a corporation or a startup is interested in building a new service or improving an existing one based on the idea from the article, this is a plus.

As already mentioned, computer science articles are rarely written alone. But as a rule, one of the authors spends much more time and effort than the rest. His contribution to scientific novelty is the greatest. In the list of authors, such a person is indicated first - and in the future, referring to the article, they can only mention him (for example, "Ivanov et al" - "Ivanov and others" in Latin). However, the contribution of others is also extremely valuable - otherwise it is impossible to be on the list of authors.

Review Process

Articles usually stop accepting a few months before the conference. After submitting an article, reviewers have 3-5 weeks to read, rate, and comment on it. This happens according to the single blind system, when the authors do not see the names of the reviewers, or double blind, when the reviewers themselves do not see the names of the authors. The second option is considered more unbiased: several scientific papers have shown that the popularity of the author affects the decision of the reviewer. For example, he may consider that a scientist with a large number of already published articles is a priori worthy of a higher rating.

Moreover, even in the case of double blind, the reviewer will probably guess the author if they work in the same field. In addition, at the time of the review, the article may already be published in the arXiv database, the largest repository of scientific works. The conference organizers do not forbid this, however, they recommend using a different name and a different annotation in the publication for arXiv. But if the article was posted there, finding it is still not difficult.

There are always several reviewers evaluating an article. One of them is assigned the role of a meta-reviewer, who should only look at the verdicts of his colleagues and make the final decision. If the reviewers differ in their assessment of the article, the meta-reviewer can also read it to complete the picture.

Sometimes, after reviewing the rating and comments, the author gets the opportunity to enter into a discussion with the reviewer; there is even a chance to convince him to change his mind (however, such a system does not work for all conferences, and it is even much less possible to seriously influence the verdict). In the discussion, you can not refer to other scientific works, with the exception of those that are already referenced in the article. You can only "help" the reviewer to better understand the content of the article.

Conferences and magazines

Articles on computer science are more often sent to conferences than to scientific journals. The reason is that journal publications have requirements that are more difficult to meet, and the peer review process can take months or even years. Computer science is a very fast-growing field, so authors are usually not prepared to wait that long for publication. However, an article already accepted for the conference can then be supplemented (for example, with more detailed results) and published in a journal where the size restrictions are not so strict.

Events at the conference

The format of the presence of the authors of approved articles at the conference is determined by the reviewers. If the article is given the green light, then you are most often given a stand for a poster. A poster is a static slide with a summary of the article and illustrations. Part of the conference halls are filled with long rows of poster stands. The author spends a significant part of the time near his poster, communicating with scientists who are interested in the article.

A slightly more prestigious option for participation is a quick report (lightning talk). If the reviewers considered the article worthy of a quick report, the author is given about three minutes to speak to a wide audience. On the one hand, lightning talk is a good opportunity to tell about your idea not only to those who, on their own initiative, became interested in the poster. On the other hand, enterprising poster visitors are more prepared, more immersed in your particular topic than the average listener in the hall. Therefore, in a quick report, you still need to have time to bring people up to date.

Usually, at the end of their lightning talk, the authors give the poster number so that listeners can find it and better understand the article.

The last, most prestigious option is a poster plus a full-fledged presentation of the idea, when you don’t need to rush so much with the story.

But of course, scientists - including the authors of approved papers - come to the next conference not only to show themselves. First, they are, for obvious reasons, eager to find posters related to their field. And secondly, it is important for them to replenish the list of contacts with the aim of joint academic work in the future. This is not hunting - or at least the very first stage of it, followed by at least a mutually beneficial exchange of ideas, developments and joint work on one or more articles.

At the same time, productive networking at a top conference is difficult due to the total lack of free time. If, after a whole day spent on presentations and in discussions at posters, the scientist has retained his strength and has already overcome jet lag, then he goes to one of the many parties. They are hosted by corporations - as a result, parties are often more hunting in nature. At the same time, many guests use them not at all to find a new job, but, again, for networking. In the evening there are no more reports and posters - it is easier to "catch" the specialist you are interested in.

From idea to production

Computer science is one of the few industries where the interests of corporations and startups are strongly associated with the academic environment. NIPS, ICML and other similar conferences have a lot of people coming from the industry, not just universities. For the field of computer science, this is typical, but for most other sciences, the opposite is true.

On the other hand, not all the ideas presented in the articles immediately go to the creation or improvement of services. Even within the same company, a researcher can offer colleagues from the service a breakthrough idea by scientific standards and be refused implementation for a number of reasons. One of them has already been mentioned here - this is the difference between the "academic" data set, according to which the article is written, and the real dataset. In addition, the implementation of an idea may be delayed, require a large amount of resources, or improve only one indicator at the cost of worsening other metrics.

The situation is saved by the fact that many developers and themselves are a bit of researchers. They attend conferences, speak the same language with academics, offer ideas, sometimes participate in the creation of articles (for example, in writing code), or even act as authors themselves. If a developer is immersed in the academic process, follows what is happening in the research department, in a word, if he demonstrates a counter movement towards scientists, then the cycle of turning scientific ideas into new service capabilities is reduced.

We wish all young researchers good luck and great achievements in their work. If this post did not tell you anything new, then you may have already published at a top conference. Register for the prize and nominate research supervisors yourself.

Source: habr.com

Prize named after Ilya Segalovich. A story about computer science and publications on the occasion of the launch