Some thoughts on real open source Artificial Intelligence

We are in the midst of a hype around Artificial Intelligence (AI) and the market is trying to get ahead of each other in all sorts of ways. One way is to claim that their AI is open source. So far, there has been a lot of open washing, meaning that they claim they are open but failing to apply common practices, like licenses approved by the Open Source Institute (OSI) to make that clear. This is so common, some weekly newsletters even have recurring segments listing all perpetrators.

Adding to this, there is an ongoing discussion about what open source for AI should mean, and OSI is even drafting a new definition for this. As late as today at the OSPOs for good conference, some big companies tried to claim that there is a gradient of open source, and if not, threatened not to be supportive of open source at all, and that seems just disingenuous to me. Through various venues, webinars and chats, I have tried to make a point of something that seems obvious to me, which leans back on the original four software freedoms. So before going further, let’s just remind ourselves of these.

The four freedoms of free software

  • The freedom to run the program as you wish, for any purpose (freedom 0).
  • The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.
  • The freedom to redistribute copies so you can help your neighbor (freedom 2).
  • The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.

These freedoms are not a gradient, all four are needed, or it is not free and open source software.

It might be worth mentioning that OSI’s definition of open source is quite aligned with the spirit of these, and in some of their points even clearer. For example, their second point about source code includes the following clarification:

The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

The Open Source Definition – Source Code

These two definitions have led me to the following arguments.

My arguments about the freedoms

In various venues, I have had arguments essentially like the one below.

Me: So to be able to study the AI and to modify it in any way, I need access to the data that the models have been trained on. If I can’t see, remove, add, or modify the data, I don’t have full freedom to study the AI or change it to suit me. Essentially, it is a black box, and it doesn’t matter if the weights are free because I can’t change what is weighted.

Them: But we cannot open source the data because we don’t own the copyright of the data we trained the models on.

Me: …

So they openly admit that the system as a whole is not free.

In my opinion, such AI systems per definition are not, and cannot be, viewed to be free open source software.

Another similar discussion starts with me doing the same rant, but is followed in a slightly different way.

Them: But we cannot publish the data because even though we made sure we own the copyright of the data, it includes private data and personal information.

Me. …

Now, I have to give it to them that we are a bit closer. But I still don’t have freedom 1, I can for example not clean the data from biased or erroneous private data to make the model better. There is also a risk that distributing the model would distribute access to these private data if I change other parts of the system, and thus the risk of me breaking laws doesn’t really give me freedom 3.

Where will we end up?

In conclusion, the essence of truly open source AI lies not only in the accessibility of the code or the weights but also in the freedom to access, modify, and distribute the data upon which these models are built. Without freely licensed and fully accessible training data, the promises of transparency, collaboration, and improvement inherent in the open source ethos remain unfulfilled. The four freedoms that define open source software are undermined when data remains proprietary or restricted. As the discourse around open source AI continues to evolve, it is imperative that we push for these clear values of freedom and openness, or we risk heading into a future we rely on black boxes, blind trust and reduced possibilities to shape the software as we wish.

In addition, we should perhaps find a term that can be used for other types of AI systems. Reusable, gratis, or shareware might be terms to use for that. If you have ideas, please let me know.

Mentor i Hack for Sweden

Hack for Sweden är ett nationellt hackathon, drivet av svenska myndigheter. Sedan några år tillbaka är det DIGG som har ansvaret för att genomföra den. 2015 var jag med som dataägare från Wikimedia Sverige, 2016 var jag med och hackade själv och 2017 hjälpte jag igen till som dataägare, men från Riksantikvarieämbetets sida.

I årets upplaga av Hack for Sweden deltar jag i en ny roll, nämligen som mentor för öppen källkod. Jag hoppas att jag kan bistå med hjälp hur man jobbar med det. Jag misstänker att eventuella frågor kommer att fokusera på licenser, men jag ska försöka fokusera att lyfta blicken från att bara vara öppen på pappret till att anamma ett arbetssätt som genomsyras av principerna bakom öppen källkod. Det är ju trots allt det jag gör till vardags på Foundation for Public Code.

My talk at Open Source Summit India 2021

A couple of weeks ago I had the honor to give at talk at Open Source Summit India 2021. I titled it Real world collaboration through the Standard for Public Code and it was a lightning talk about just that.

My talk, starting 3:58:58.

Here are the slides I used.

En standard för offentlig kod

Jag började ju nyligen jobbaFoundation for Public Code och det har varit en fantastisk start i en härligt idédriven organisation. På många sätt påminner det om min tid i Wikimedia Sverige, bland annat genom en stor transparens. Vi har inget intranät, utan publicerar alla våra processer publikt på en webbplats, och det mesta av arbetet sker på Github.

En av de sakerna vi gör är en standard för offentlig kod. Det är en samling minimikrav på vad man egentligen behöver göra för att det du gör som öppen källkod ska vara återanvändningsbart inte bara i teorin, utan även i praktiken. Standarden själv är såklart licensierad CC 0.

Idag släppte vi version 0.1.4 och även om det bara var en mindre förbättring så var det stort för mig, för nu finns jag med bland medförfattarna!