Some thoughts on real open source Artificial Intelligence

We are in the midst of a hype around Artificial Intelligence (AI) and the market is trying to get ahead of each other in all sorts of ways. One way is to claim that their AI is open source. So far, there has been a lot of open washing, meaning that they claim they are open but failing to apply common practices, like licenses approved by the Open Source Institute (OSI) to make that clear. This is so common, some weekly newsletters even have recurring segments listing all perpetrators.

Adding to this, there is an ongoing discussion about what open source for AI should mean, and OSI is even drafting a new definition for this. As late as today at the OSPOs for good conference, some big companies tried to claim that there is a gradient of open source, and if not, threatened not to be supportive of open source at all, and that seems just disingenuous to me. Through various venues, webinars and chats, I have tried to make a point of something that seems obvious to me, which leans back on the original four software freedoms. So before going further, let’s just remind ourselves of these.

The four freedoms of free software

  • The freedom to run the program as you wish, for any purpose (freedom 0).
  • The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.
  • The freedom to redistribute copies so you can help your neighbor (freedom 2).
  • The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.

These freedoms are not a gradient, all four are needed, or it is not free and open source software.

It might be worth mentioning that OSI’s definition of open source is quite aligned with the spirit of these, and in some of their points even clearer. For example, their second point about source code includes the following clarification:

The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

The Open Source Definition – Source Code

These two definitions have led me to the following arguments.

My arguments about the freedoms

In various venues, I have had arguments essentially like the one below.

Me: So to be able to study the AI and to modify it in any way, I need access to the data that the models have been trained on. If I can’t see, remove, add, or modify the data, I don’t have full freedom to study the AI or change it to suit me. Essentially, it is a black box, and it doesn’t matter if the weights are free because I can’t change what is weighted.

Them: But we cannot open source the data because we don’t own the copyright of the data we trained the models on.

Me: …

So they openly admit that the system as a whole is not free.

In my opinion, such AI systems per definition are not, and cannot be, viewed to be free open source software.

Another similar discussion starts with me doing the same rant, but is followed in a slightly different way.

Them: But we cannot publish the data because even though we made sure we own the copyright of the data, it includes private data and personal information.

Me. …

Now, I have to give it to them that we are a bit closer. But I still don’t have freedom 1, I can for example not clean the data from biased or erroneous private data to make the model better. There is also a risk that distributing the model would distribute access to these private data if I change other parts of the system, and thus the risk of me breaking laws doesn’t really give me freedom 3.

Where will we end up?

In conclusion, the essence of truly open source AI lies not only in the accessibility of the code or the weights but also in the freedom to access, modify, and distribute the data upon which these models are built. Without freely licensed and fully accessible training data, the promises of transparency, collaboration, and improvement inherent in the open source ethos remain unfulfilled. The four freedoms that define open source software are undermined when data remains proprietary or restricted. As the discourse around open source AI continues to evolve, it is imperative that we push for these clear values of freedom and openness, or we risk heading into a future we rely on black boxes, blind trust and reduced possibilities to shape the software as we wish.

In addition, we should perhaps find a term that can be used for other types of AI systems. Reusable, gratis, or shareware might be terms to use for that. If you have ideas, please let me know.

Leave a Reply

Your email address will not be published. Required fields are marked *