Why are there no antitrust claims vs GitHub Copilot, when there is a precedent?


Microsoft GitHub released CoPilot last year, a tool for generating code and auto-completion, that’s been trained on vast amount of open source code, notably all open source code hosted on GitHub itself.

This sparked a number of discussions and law suits surrounding the tool. There are many concerns ranging from the training of the model, the usage of the tool, and the final copyright of the generated code snippets. One of the ongoing lawsuit is a class action lawsuit pending from JOSEPH SAVERI LAW FIRM vs MICROSOFT and OPENAI bringing a variety of copyright claims (some have already been dismissed since it was announced).

The product has been out for a whole year now. There are many claims that have been discussed online.

Unfortunately there is one that I have not seen addressed online and that may be important. I am started to worry the lawyers accidentally missed the precedent.

Today we’re going to talk about the potential for an antitrust case against Microsoft GitHub and Microsoft GitHub Copilot, there is a precedent vs another tech giant that is surprisingly similar.

Precedent: Antitrust action vs Google Books

See Wikipedia for more details: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,_Inc.

Google released a product called Google Books circa 2004. Google started scanning and digitizing all books, and making the books accessible in digital form online. When you searched for a paragraph or a quote from a book in Google, Google could return you a link to the page of the book, on Google Books.

Google would only show you the page(s) with the sentence you were looking for, it could show a few pages before or after. It didn’t allow to read the whole book or any whole chapter, although Google had digitized the full book. This detail was important for the case but not for what we are going to discuss today.

This was the subject of a major copyright lawsuit because Google had digitized all books they could get theirs hands on and made them accessible online (in part). The lawsuit was led by the Authors Guild, allegedly the oldest and largest professional organization for writers.

Around 2009, Google and the Authors Guild were reaching an agreement to license all the works under the Author Guilds to Google, for use in Google Books. This would have ended the lawsuit, however the agreement was blocked by antitrust.

Both parties attempted to revise the agreement multiple times, circa 2009-2011 and all attempts at an agreement were blocked again, by antitrust.

See department of justice: Statement of Interest of the United States of America Regarding Proposed Class Settlement

“The current settlement proposal would stifle innovation and competition in favor of a monopoly over the access, distribution and pricing of the largest collection of digital books in the world, and would reinforce an already dominant position in search and search advertising.”

If you’re curious, the case continued in different directions and reached important decisions regarding books, I won’t get into them because they are not relevant to GitHub. (Google being allowed or not allowed to show a one page preview of a book, to a user who was looking for a quote from a book, is not directly applicable to the concerns surrounding GitHub and GitHub Copilot)

Similarities with Microsoft GitHub Copilot

There are a lot of similarly between the two cases of Google Book and GitHub Copilot:

Microsoft GitHub is the largest collection of open source code in the world. Microsoft GitHub is in a unique and dominant positions to host and access and distribute most of the open-source code in the world.

GitHub boasted 100+ millions repos in 2018 and 100+ million registered users in 2023. It’s difficult to come across market share numbers, estimates are around 90% of all open source software hosted on GitHub.

This is exactly the circumstances where Google Book and the Writers Guild were, having access and distribution to the largest unmatched collection of digitized books in the world.

The creation and training of Microsoft GitHub Copilot required access and scan of vast amount of open source code from different authors. The making of GitHub Copilot was only possible thanks to the unique position of GitHub in hosting and controlling access to most of the open source code in the world.

There’s an important notion to address here. Open source code on GitHub might be thought of as “open and freely accessible” but it is not. It’s possible for any person to access and download one single repo from GitHub. It’s not possible for a person to download all repos from Github or a percentage of all repos, they will hit limitations and restrictions when trying to download too many repos. (Unless there’s some special archives or mechanisms I am not aware of).

The precedent has further implications to any claim related to copyright infringement against GitHub Copilot and on whether GitHub may have rights to use the code it’s hosting to create CoPilot.

In defense to some copyright claims, Microsoft GitHub Copilot may represent that it is permitted to use the code due to some contractual clause (e.g. Terms of Services allow them to use code to improve their service -and to create derivative services like Copilot-). This is likely to be false, due to the precedent.

The precedent with the Writers Guild prohibited licensing of books to Google Book for antitrust, that would have granted rights to the largest collection of books in the world and put Google in a position of monopoly. Any defense from GitHub claiming it can self license all open source code to use in training is likely to fall similarly under antitrust.

If anything, it may be more likely to fail because the Writers Guild and Google were two independent entities trying to negotiate rights, which was denied, whereas GitHub is a single entity acting as both sides. It’s in a much stronger position of monopoly if it can contractually grants rights to itself on all open source code it is hosting (almost all open source code in the world).

Last but not least. The precedent did not allow the Writers Guild to settle and license all their books to Google (assuming there were copyright issues and licensing was required).

If the precedent is applied similarly to GitHub, it might prevent plaintiffs (like Saveri Law Firm) to reach any settlement with GitHub.

The goal of a copyright lawsuit is to generate punitive damages or licenses fees (the US is all about money), it can be quite problematic for either or both parties if an agreement cannot be reached due to antitrust. Well, that depends on who is represented and what do they seek.

Corollary, if GitHub is aware of that precedent, they should have zero interest in discussing any settlement since any settlement will be denied for antitrust (and it’s against their interests anyway, they don’t want to pay to use any source code). The only move for GitHub is to go all the way and get a decision that training is fair use (there’s another precedent on that)… but can it be fair use when they rely on their unique position to access and control most open source code and they profit financially?

Conclusion

Hope the potential for antitrust in the case of Microsoft GitHub Copilot is looked at more closely. There is potential!

The outcome is important and it will have far reaching consequences for platforms that distribute content.

If GitHub can produce code by training an AI on all code it is hosting, Youtube could produce videos and music by training an AI on all content it is hosting, the Writer Guilds could produce books by training an AI on all books it owns the rights for, Shutterstock could produce more stock images by training an AI on all stock images it is hosting.

2 thoughts on “Why are there no antitrust claims vs GitHub Copilot, when there is a precedent?

  1. Meh, no thanks. OSS is OSS, people need to start understanding that OSS is free as in Freedom, at least under most permissive licenses. Enacting BS lawsuits just because one company has the resources and interest into doing something, with the core difference here being that OSS is totally free and open for any other ML company to use in their training set, stifles innovation.

    I get Google Books, it was copyrighted/protected info; but let’s stop deviating from what OSS should be, open and free to use.

    Like

  2. It’s not possible for a person to download all repos from Github or a percentage of all repos, they will hit limitations and restrictions when trying to download too many repos.

    Is that really true? GitHub throttles their API, of course, but do they actually throttle the anonymous downloads via https://github.com/user/repo.git links?

    That said, while I find your assertion that GitHub is “controlling access to most of the open source code in the world” dubious at best, your lesser point that GitHub may have much better large-scale access to that code to be good.

    Perhaps one solution (or part of a solution) would be to legislate that companies making “AI” systems available to the public must make their training data sets available to the public as well, so that they can both be examined and analysed (to see what’s in them, look for copyright violations, etc.) and used by other companies and individuals.

    This would also force these companies to make a serious attempt to avoiding copyright and other IP violations in their training data sets, since these would no longer be hidden away where nobody can see them.

    Like

Leave a Reply