Litigation against Microsoft and OpenAI related to the GitHub Copilot code generator

Matthew Butterick, an open-source typography developer, and Joseph Saveri Law Firm have filed a lawsuit (PDF) against the technology vendors used in the GitHub Copilot service. Respondents include Microsoft, GitHub, and the companies behind the OpenAI project, which produced the OpenAI Codex code generation model that underpins GitHub Copilot. During the proceedings, an attempt was made to involve the court in determining the legality of creating services like GitHub Copilot, and finding out whether such services violate the rights of other developers.

The defendants' activity is compared to creating a new kind of software piracy based on the manipulation of existing code using machine learning methods and allowing to profit from the work of other people. The creation of Copilot is also seen as the introduction of a new mechanism for monetizing the work of open source developers, despite the fact that GitHub previously promised never to do this.

The position of the plaintiffs boils down to the fact that the result of code generation by a machine learning system trained on publicly available source texts cannot be interpreted as a fundamentally new and independent work, since it is a consequence of the processing of existing code by algorithms. According to the plaintiffs, Copilot only reproduces code that has direct references to existing code in open repositories, and such manipulations do not fall under fair use criteria. In other words, the synthesis of code in GitHub Copilot is considered by the plaintiffs as the creation of a derivative work from existing code distributed under certain licenses and having specific authors.

In particular, when training the Copilot system, code is used that is distributed under open licenses, in most cases requiring a notice of authorship (attribution). When generating the resulting code, this requirement is not met, which is a clear violation of most open licenses such as the GPL, MIT, and Apache. In addition, Copilot violates GitHub's own terms of service and privacy, does not comply with the DMCA, which prohibits the removal of copyright information, and the CCPA (California Consumer Privacy Act), which regulates the treatment of personal data.

The text of the lawsuit provides an approximate calculation of the damage caused to the community as a result of Copilot's activities. Under Section 1202 of the Digital Millennium Copyright Act (DMCA), the minimum damages are $2500 per infringement. Given that the Copilot service has 1.2 million users and there are three DMCA violations (attribution, copyright and license terms) for each use of the service, the minimum total damage is estimated at 9 billion dollars (1200000 * 3 * $2500).

The Software Freedom Conservancy (SFC), which has previously been critical of GitHub and Copilot, commented on the lawsuit recommending that community advocacy not deviate from one of the previously formulated principles - "community-oriented enforcement should not give priority to financial gain." According to the SFC, Copilot's actions are unacceptable primarily because they undermine the "copyleft" mechanism aimed at providing equal rights to users, developers and consumers. Many of the projects covered in Copilot are supplied under copyleft licenses, such as the GPL, which require derivative works code to be supplied under a compatible license. Pasting existing code provided by Copilot may unwittingly violate the license of the project from which the code was borrowed.

Recall that in the summer GitHub launched a new commercial service GitHub Copilot, trained on an array of source codes hosted in public GitHub repositories, and capable of generating typical constructs when writing code. The service can form rather complex and large blocks of code, up to ready-made functions that can repeat text fragments from existing projects. According to GitHub, the system tries to recreate the structure of the code rather than copying the code itself, however, in about 1% of cases, the proposed recommendation may include code snippets of existing projects larger than 150 characters. To prevent substitution of existing code, Copilot has a special filter that checks for intersections with projects hosted on GitHub, but this filter is activated at the discretion of the user.

Two days before the filing of the lawsuit, GitHub announced its intention to implement a feature in 2023 that allows you to track the relationship of snippets generated in Copilot with existing code in the repositories. Developers will be able to view a list of similar code already present in public repositories, as well as sort intersections by code licenses and when the change was made.

Source: opennet.ru

Add a comment