What the AI revolution means for Open Source and our society
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 43 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67111 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSS Backstage 202436 / 43
8
11
00:00
View (database)Musical ensembleEvent horizonRight angleExistenceOpen sourceBitTerm (mathematics)Data conversionLatent heatArtificial neural networkComputer animationLecture/Conference
00:39
Different (Kate Ryan album)Open sourceUniform boundedness principleBitStrategy gameNumberProjective planeLevel (video gaming)MalwareStandard deviationLecture/Conference
01:22
Web 2.0Service (economics)Open sourceInformation privacyBitExterior algebraCollaborationismCategory of beingWeb serviceTelecommunicationLecture/Conference
02:11
Connectivity (graph theory)Different (Kate Ryan album)FamilyCore dumpInstallation artOnline chatShared memoryOpen setComputer fileCollaborative softwareEmailPiMereologyOffice suiteOpen sourceVideoconferencingSynchronizationGroup actionLecture/ConferenceComputer animation
03:07
Open setInformation securitySoftwareService (economics)Connectivity (graph theory)Point cloudInternet service providerTelecommunicationScaling (geometry)Bit
03:33
Bookmark (World Wide Web)Revision controlSystem callLocal GroupLink (knot theory)Zoom lensWide area networkInterface (computing)Computer fileOnline chatMereologyShared memorySynchronizationVideoconferencingSystem callEmailBitClient (computing)SoftwareComputer animationMeeting/InterviewLecture/Conference
04:07
Dynamic random-access memoryEvent horizonFile formatControl flowWeb pageWhiteboardPlastikkarteComputer fileBookmark (World Wide Web)View (database)Form (programming)Sheaf (mathematics)Group actionGreen's functionOffice suiteConnectivity (graph theory)Exterior algebraVideo gameSpeech synthesisBitComputer animationLecture/Conference
04:37
Open sourcePersonal digital assistantMultiplication signRight angleComputer clusterPi2 (number)Lecture/Conference
05:28
1 (number)SoftwareSelf-organizationMeeting/InterviewLecture/Conference
06:03
LeakCodeMathematical modelInformationCalculation2 (number)Constructor (object-oriented programming)Wave packetInformation privacyDifferent (Kate Ryan album)Cloud computingPlanningService (economics)Design by contractRight angleComputer-assisted translation
07:04
Service (economics)Formal languageDigital photographyInformation2 (number)Electric generatorMedical imagingInternetworkingTerm (mathematics)BitServer (computing)1 (number)WordExpert systemData structureMathematical modelMultiplication signMetropolitan area networkGraphics processing unitDifferent (Kate Ryan album)Lecture/Conference
08:56
Normal (geometry)Point (geometry)Information privacyPower (physics)Group actionPoint cloudLecture/Conference
09:22
Point (geometry)SoftwarePoint cloudWave packetSoftware frameworkOpen sourcePay televisionOpen setPhysical lawInternet service providerService (economics)Right angleDimensional analysisLecture/Conference
10:21
Source codeOpen sourcePhysical systemLecture/Conference
10:50
WindowFood energyWave packetInternetworkingMathematical modelWeb serviceMeasurement2 (number)CASE <Informatik>Open setMathematical modelOpen sourcePower (physics)Service (economics)Lecture/Conference
12:11
Degree (graph theory)Connectivity (graph theory)Different (Kate Ryan album)INTEGRALBasis <Mathematik>CASE <Informatik>Symbol tableBlack boxLecture/Conference
13:03
FreewareProcess (computing)Open setSpeech synthesisMobile appInstallation artTranslation (relic)Virtual machineMathematical modelMaschinelle ÜbersetzungCASE <Informatik>Set (mathematics)Decision theoryWeb pageTranslation (relic)Lecture/Conference
13:29
Mathematical modelInstallation artMobile appVirtual machineTranslation (relic)Speech synthesisService (economics)Mathematical modelComputer configurationPhysical systemSpeech synthesisKey (cryptography)Open setComputer animation
14:08
NumberFood energyAreaObjekterkennungOpen sourceLocal ringDigital photographyComputer animationLecture/Conference
14:40
Object (grammar)AreaDigital photographyPattern recognitionShared memoryPhysical systemFamilyComputer-assisted translationTheory of relativity
15:20
Open sourceFormal languageBitMathematical modelPersonal digital assistantComputer animationLecture/Conference
15:53
Event horizonComputing platformFunction (mathematics)Local GroupData storage deviceHypermediaSpacetimeModemContent (media)Right angleINTEGRALCartesian coordinate systemPersonal digital assistantTranslation (relic)Formal languageLecture/ConferenceComputer animationXML
16:15
VolumeCommutative propertyDisk read-and-write headInclusion mapSoftwareEmailBookmark (World Wide Web)Group actionDecision theoryThread (computing)Physical systemStandard deviationMathematical analysisWeb portalMessage passingPersonal digital assistantProduct (business)TelecommunicationClient (computing)Office suiteCASE <Informatik>Formal languageWordEmailThread (computing)Point cloudMultiplication signINTEGRALComputer animation
16:57
Virtual realityEmailEvent horizonShift operatorBookmark (World Wide Web)Address spaceFile formatException handlingPivot elementElectric currentContent (media)InformationComputer virusEmailOpen source1 (number)Event horizonPoint cloudSource code
17:21
Message passingComputer-generated imageryEvent horizonLink (knot theory)Similarity (geometry)Social softwarePredictionObservational studyInternet forumDigital photographyComputer configurationTemporal logicPlastikkartePersonal digital assistantPlanningBlogGradientStaff (military)Wave packetInformationForceWritingPhysical systemAreaSoftware testingMultiplication signFunction (mathematics)Computer-assisted translationData conversionPersonal digital assistantFormal languageMessage passingPhysical systemComputer animation
18:06
Bridging (networking)Physical systemSystem callDemo (music)Bit error rateAddress spaceStatement (computer science)Task (computing)Message passingPhysical systemEmailOpen sourceComputer configurationSystem callVideoconferencingComputer animation
18:28
Address spaceTravelling salesman problemAbelian categoryRegular expressionIdentity managementDataflowRule of inferenceInformation privacyHuman migrationData storage deviceOffice suiteInformation securityMobile WebAuthorizationSoftware frameworkProjective planeSystem callPersonal identification number (Denmark)Lecture/ConferenceComputer animation
18:56
Theory of relativityContent (media)BuildingPersonal digital assistantVector spaceDatabaseBitDifferent (Kate Ryan album)Token ringOnline chatOpen sourceType theoryMessage passingEmailLengthChainLecture/Conference
19:40
Point cloudComputer virusEmailBookmark (World Wide Web)outputContext awarenessOnline chatInformationTask (computing)Electric currentGroup actionNewsletterFunction (mathematics)Event horizonFreewareWhiteboardAssembly languageWhiteboardProjective planeOnline chatPersonal digital assistantData managementEmailTheory of relativityPoint cloudSubject indexingToken ringBitProduct (business)Computer animationProgram flowchart
20:37
Context awarenessPersonal digital assistantContext awarenessBlogFile formatEmailPerspective (visual)outputOpen sourceBitMessage passingMultiplication signHacker (term)Lecture/Conference
21:37
Mathematical modelPhysical systemPoint cloudLecture/Conference
22:01
Information privacyBitOpen sourceDifferent (Kate Ryan album)Perspective (visual)Internet service providerPower (physics)Bit rateLecture/Conference
22:38
Right angleComputer virusLecture/Conference
23:00
Mathematical modelOpen sourceSet (mathematics)Personal digital assistantGreen's functionMeeting/Interview
23:41
Source codePhysical systemWave packetLecture/Conference
24:16
Mathematical modelCollaborationism1 (number)Position operatorWave packetMeta elementTunisMathematical modelLecture/ConferenceMeeting/Interview
24:46
TunisMathematical modelMathematical modelWave packetState of matterProcess (computing)Position operatorCASE <Informatik>Information privacyGame controllerOpen sourceSensitivity analysisLecture/ConferenceMeeting/Interview
26:00
Wave packetAxiom of choiceRow (database)Multiplication signOpen sourceInformation privacyLecture/ConferenceMeeting/Interview
26:47
Level (video gaming)Open sourceMultiplication signMereologyWave packetSimilarity (geometry)Information privacyDirection (geometry)Lecture/Conference
27:11
Wave packetInformation privacyPerspective (visual)Lecture/Conference
27:34
Flow separationMusical ensembleMoment (mathematics)Process (computing)Multiplication signWave packetCopyright infringementCuboidLecture/Conference
27:57
MP3Mathematical modelMusical ensembleCopyright infringementLecture/Conference
28:28
Musical ensembleLecture/Conference
Transcript: English(auto-generated)
00:07
First of all, thanks a lot for the invitation, I think it's one of my favorite events here. It's the right size, not too big, not too small, and very good to have the conversations. Hopefully, after my talk, still the opportunity to discuss the details with all of you.
00:24
So, the topic of my talk is about artificial intelligence and what it means for open source and for Nextloud specifically. First, a little bit about myself. I'm doing open source, I think, since the term open source exists.
00:41
25 years, yes, I'm old, I know. In lots of different roles in the KDE community, was involved in a number of other projects, helped a little bit with the W3C, helped to develop the activity pub standard, working with the United Nations on their open source strategy.
01:02
I'm doing a lot of lobbying nowadays for open source on the European level, but I think I'm mostly invited here as the founder and CEO of Nextloud. So, what motivated me to start this whole thing, like over 10 years ago, was basically that.
01:23
So, it was clear that we're moving to a future where like five big companies control all the data, all communication, all collaboration, everything online. This is where I went, like all these web services, like Dropbox and Gmail and other services popped up.
01:40
For me, this was a dystopia. I don't want to live in a world where like everything we do is just controlled by a handful of closed source proprietary companies from the other side of the planet. So, this is why I founded Nextloud to be like the alternative that is completely open source and can be hosted yourself
02:02
and because of that, it's like privacy friendly. So, a little bit about Nextloud. Most of you, who in the audience knows Nextloud? Okay, I can keep the introduction short. So, what we do is like we have like four different core parts.
02:22
Nextloud files for the file sync and share, talk for chat and video conferencing, group there for mail, calendar, contacts, and Office for editing Office documents and everything under the umbrella of Nextloud hub. So, this is really similar structured as our competitors from Microsoft and Google. They have similar components and this is also what we do.
02:42
As I said, the main difference is that we are fully open source, 100% not open core anything, 100% open source and because that runs on premise. With on-premise, I mean you can really install it like on a small Raspberry Pi for yourself, for your friends and family, that's totally cool.
03:01
But it can also run like for a lot more users. So, the biggest installations we have has 20 million users. So, it really scales from very small to very big. That's a big service provider in South America. But it's also used like in Europe, for example, the Magenta cloud from a German telecom is based on Nextloud with 3.6 million users
03:22
and it's basically the same software. I think it's quite interesting. It really runs on a Raspberry Pi and a huge Kubernetes cluster. So, this is what we do. A little bit about the components itself. I mean, most of you know it. This is the interface for the file sync and share part.
03:40
It's just to manage your files and documents. There's Nextloud talk for chat and video conferencing. This is like obviously the chat UI, where you can press a button, turn it into a call. Again, everything self-hosted, no bit. It's like leaving your network if you don't want it. Then there's a calendar, of course, with all the sharing features,
04:00
synchronized with your phone, everything you expect. A mail client with all the usual features you expect, quite powerful, like can basically do most of the things that Gmail can do or everything. And there's an office component for editing your office documents. So, this is what we do.
04:20
We had a full alternative to big tech companies. Life was good a year ago. And then came AI. And to be honest, I got a little bit of a depression at the time because of two problems. First of all, can we even compete anymore?
04:41
Can we as an open source community with just a few people, can we compete with these big tech companies? Is there a way to really do something similar to what they are doing? Also the infrastructure, right? As I said, it should run on-premise on small devices, Raspberry Pi's at home.
05:01
At the time, it wasn't the press that you need gigantic server clusters to do AI. Is this even possible to be self-hosted? This was the one question. Can we do it? The second question is, should we do it? Because, as you might know, the whole AI thing comes with a lot of questions and challenges. So, obviously, it's a tool that can be very useful.
05:24
It can be your assistant, your helper, someone who really takes away boring things, can automate things, and it can really help you to be more productive, basically. But, obviously, there's also the dark side of it.
05:41
There's also all the problems. I will cover a few of them. And the question is then, okay, we as Nextcloud, as an ethical organization, we want to bring software to the world that helps people. So, should we even do this with AI? Or are the negative sides too negative, basically? And we are not the only ones with these concerns.
06:02
I mean, this is just a bit out of the press. There are companies like Apple and Samsung and Goldman Sachs and many others. They are blocking the use of ChatGPT. You might wonder, okay, why is that? Because they are really concerned about, first of all, they are concerned about misinformation. They might get answers that are not true,
06:20
which isn't harming them. And the second is the whole question around privacy. Because, obviously, if you copy your contracts and whatever documents into ChatGPT, well, they have the information. But this is not fundamentally different than other cloud services. The fundamental difference is that this is then used for training of the model.
06:40
So, it's absolutely possible that if someone from, let's say, Airbus is discussing some construction plans in ChatGPT, then later, another person comes from Boeing, who is asking a question and gets answers that are based on the information of the competitor. And this is, of course, not a risk that cannot be calculated. This is why a lot of companies ban the use of this public AI services.
07:06
And this led us to a lot of discussions internally and also with some experts. And so, a little bit over a year ago, we did two things. First of all, we founded a dedicated team to work on these AI features.
07:20
And second is, because we wanted to try if it's possible to compete. And the second is that we started this initiative that we called Ethical AI at the time, to really question what we are doing and if this is really ethical and helpful. Obviously, ethical, that's a big word.
07:40
A lot of people understand different things under the term ethical. So, it is important to really structure this a little bit more. We can just say, well, we are the good ones. Okay, sure, but why? What's different? So, we really have to go into a little bit more detail. So, what we did is, we looked into potential problems. And this was not hard to look for because it was all over the press,
08:03
or it's still all over the press. And we identified a few critical questions about AI. The first question is the thing around discrimination. Because as you know, these large language models, they are trained on information on the internet. And it gives back information based on the internet.
08:22
And obviously, as you know, the internet is full of discrimination. And obviously, these AI systems also give you discriminatory answers. So, if you ask like an image generation tool to, hey, can you please generate a photo of a doctor? You probably got to get a photo of a white man, obviously,
08:43
because that's what is found on the internet. That's obviously a problem. We want to do something better here. The second question is the CO2 footprint. Because as I mentioned, these gigantic server farms are needed with GPUs and consume a lot of power.
09:01
And the question is, what are we doing here? I mean, are we, because AI is now cool, accelerating like climate change here? Or what is really the consequence of the actions here? Next point is the privacy question. I already mentioned it. There's data leakage, not only by like normal leakage of documents
09:23
on some cloud services, but by training data, which is quite interesting. There's a complete new dimension, which is not covered by normal laws. So, the leaking is indirect. And then, of course, there's the last question. Is this even like available for everybody?
09:42
Because one of the strong points of open source that I really believe in is that we are building and creating software that can be used by everybody on the planet, right? Even if you're in the global south somewhere and you don't have so many resources, you cannot really have the money to buy like a Microsoft subscription or the open AI subscription or something.
10:02
We want to provide technology that's usable by everybody, even without money, obviously. And how is this with AI? Is this like our only paid feature in the future? Or how does this work? So, this led us, after a lot of discussion, to this ethical AI framework that we developed.
10:24
It works like that, that we developed this traffic light system from red, which is not so ethical, to green, which is ethical in our definition. And of course, the question is, how do you measure that? So, we identified like three requirements, how we measure that.
10:42
The first is that the source code should be open source. Why is this important? First of all, if the source code is open source, you can run it yourself. That's very important. And second is, you can also then measure the energy consumption because just a SaaS service on the internet, you cannot measure anything.
11:00
You just use it. If you ask Google or Microsoft, hey, how much power do you consume? You don't get an answer. And they probably don't really care because the money in their case comes from somewhere else. This is all subsidized from other channels. So, the energy consumption can only be measured if it's open source and you can also then optimize it.
11:20
And there are lots of examples, tools like PyTorch, for example, with every release gets better and better and more efficient and this is only possible because it's open source and transparent. Second is, the training models should be freely available and because only if they are available, you can take them and run them locally. And then you can make sure that no data is leaking anywhere else.
11:44
And this is something if you look at GBT3 or GBT4, obviously the model is not available. It just exists on the infrastructure of OpenAI. You can access it through a web service but you cannot really take it and run it locally. So, running it locally is very important.
12:01
And the third requirement is that the training data should be available because only if you can look into the training data, you can check for biases and if you find problems, you can actually fix them and improve it. Otherwise, it's just a black box. So, there is some intelligence in there but you don't know what the basis of it is.
12:22
So, we have these three requirements and based on these three requirements, we give these traffic light symbols. Basically, if none is available, it's red. This would be something like the Gemini stuff from Google or the JetGBT stuff. None of it is the case.
12:40
Until green, where everything is checked. And these are the more ethical tools and I will talk about them in a second. And exactly as I said, we showed those traffic lights to our users when they install different components because in Exelot, we actually have integration into JetGBT
13:03
or Dali if you really want to. But then on the install page, you see the red traffic light. You then see what you're doing. Hopefully, you understand what it means. It's your decision. But obviously, we provide the more ethical solutions too.
13:21
So, this is an example of the settings page. So, in every case, like here for example with the translation system, you can choose. You can for example say, hey, I want to use the JetGBT API for translating documents in xCloud. Or you can say, now I want to use the Opus model from the University of Helsinki which runs completely local
13:41
and is according to our definition completely ethical and completely open source and transparent and you can host it yourself even on a Raspberry Pi. The same for other things. For example, our speech-to-text system. Again, you can use like a whisper service in OpenAI or you can use something that runs completely local
14:00
and no data is leaking anywhere else. So, this is something where we always give the options so that people can decide what to do. But as I said, obviously, we with our internal team, we focus on developing only open source and the local AI systems, obviously. Again, if you want to, you can use something else.
14:20
But we put our energy into the ethical area here. So, now I want to give you some examples of what we actually do there. There is a number of AI features that we developed already a while ago. For example, we have face and object recognition in our NextCloud Photos area.
14:42
So, as you know, you can synchronize your photos with NextCloud and you can detect them if this is a cat or mouse or bicycle or whatever. You can search for it, which is very nice. Similar as you know from Apple and Google and other systems. But again, this runs completely local. No data is sent anywhere else. Same as with face recognition, where you can group your family members
15:02
nicely together automatically. But again, no data is leaking anywhere else. Everything on device. And the same with some other features like related resources, recommended of shares and many other things. So, these are features we already developed roughly two years ago. But as I said, like a year ago, we really stepped up the effort
15:23
and said, okay, let's see what we can do. Is there a way to do some features in an ethical way, in an open source way? Completely local. So, some examples. There is a NextCloud Assistant that we developed. This is using a large language model. This runs 100% open source.
15:42
And this is completely self-hosted. I need to speed up a bit here. Sorry for that. And it can do lots of things. For example, first of all, it has the standard prompt. You can ask any questions. And you get some answers. Just as you expect. But the really interesting thing is the integration into the applications of NextCloud.
16:03
For example, in text. So, let's say you have a document. You have a text document that you're working on. You can mark any text. There is an Assistant pop-up coming up. You can say, generate a headline or summarize or translate in another language. And it just does something. In this case, generates a headline.
16:21
And you can just copy it and insert it. So, this is very helpful to work with text. And as I said, you can summarize it or make it long or short or change the wording. So, that's a very useful feature. Then NextCloud Mail. We have an integration that we have the priority inbox on the left side.
16:42
Which shows you the important mails. And here's a feature where you can summarize email threads. So, maybe you get like long email threads. You can get a nice summary of what is discussed. If you don't have the time to read everything that your friends and colleagues are sending you. Then, of course, there's a way to write emails.
17:02
You can say, hey, I want to have this mail, this invitation to this event. And it can generate a mail from you. Again, everything I'm showing you is completely running, completely local. From my understanding, we are the only ones who can do this completely local as open source. Every other tools like Gmail and Outlook and the others use the cloud for that.
17:21
Then, in NextCloud Talk. We have features where we can automatically translate chat messages from one language to another. To speak with colleagues. I mean, in NextCloud, we have colleagues in 22 countries. So, we're very international. You can automatically translate it. This is very useful. Then, you can also generate some images and post it directly into a chat.
17:42
Every talk needs a cat picture, of course. So, here's the cat picture. That's generated and you can use it directly in your chat conversation if you want. Then, there's the system. It's also directly available here. You can, in the middle of a discussion, say, hey, assistant, what are the five most important things? Blah, blah, blah. And it gives you some output.
18:01
It's basically like a real assistant that you can just trigger at any time in a chat conversation. Then, there is a dictation system where you want to dictate like emails or text messages. There's a dictation system using Whisper. Again, completely local and open source. In video calls in NextCloud Talk, there's, of course, the option to record the call.
18:22
There's also the option to generate a transcript of the call, even with a summary. So, after every call, you can get a transcript and we can even identify to-dos in the call there and then assign them to the people by just talking about it. It's actually quite cool. And kudos to the Whisper project here, of course. And there are some other features like classifying documents.
18:42
We can detect, oh, this is a document that contains like social security numbers or other things. So, we can detect that. So, these are all the features we developed in the last 12 months. And I'm actually quite happy what we are able to do. And I want to give you a little bit of a preview what comes out in two weeks
19:01
because what we did so far is mostly like catching up. But I think we can also do some innovative things. So, the first thing we released in two weeks is something that we call ContextChat. And this is actually quite innovative. And it works like that, that all the content in your next cloud,
19:20
like chat messages, email, documents, are indexed and stored in a vector database. That's a new type of database which works like in a multi-dimensional way which can build relations between different tokens. And there's an open source tool called LangChain that we integrated and this means that you have an assistant that knows your data.
19:42
So, you can ask the assistant, hey, can you please, here for example, can you summarize all the emails from Ross from the last week? And because the assistant knows all my emails, he can give you a summary of what was in the mail from last week. Or you can say, hey, this is a project management board with all the different to-dos.
20:01
Can you please summarize how to typically organize an event? And because the assistant knows all the tickets and can actually understand the relations between the tokens and can give you a real answer. So, from my understanding, this is quite innovative. Microsoft has something called Business Chat. They have that, but of course, in the cloud.
20:21
Google announced something like that, but I think it's not production ready yet. But this is actually, we can do here, completely local. And this also runs on your Raspberry Pi if you want. You need a little bit of more storage because of the big index. But this works fine, completely local. And there are some other things we are doing. For example, there is a feature called Context Writer.
20:42
So we can generate a document that is from a format perspective based on another document. For example, you can say, hey, can you please write me a blog post about this and this topic? But very similar to this existing blog post. So I can use this as input and can generate like nice text there.
21:00
And there are other things, for example, summarization of talk messages. So if you go into a chat room and you just don't have time to read what all your colleagues and friends are discussing, you can say, hey, give me a summary and I get a summary of the history. And there are other things like suggested email answers where you can say, okay, give me a possible answer to this email.
21:21
Again, completely local and open source. So the summary is, I think I've overcome my depression a bit from a year ago because there is a lot we can do. It's really amazing. The open source community is amazing. I don't know how many of you know the Hackingface community. It's a gigantic community.
21:42
We're constantly working on these models with all the tools and some of the things are really more advanced than what comes from the commercial companies. So there's a lot we can do. There's everything we can do local. There's no need for the huge clouds here. And I think with our ethical system,
22:01
we can also put this a bit into perspective to have the different ethical AI ratings to say, okay, maybe this is a bit critical from a privacy perspective, but this is not because it runs completely local and is transparent and there's no bias in it. So for me, this is the positive ending. Again, it shows the power of the open source community
22:23
that we are actually not surpassed by big tech companies, but there's still something we can do which is provide nice privacy-friendly tools to everybody. Thanks a lot.
22:42
Any questions for Frank? Yes. Hello. Thank you very much. It's very interesting. I have one comment, if I may, and then one question. I think on the ethical side, it's all very interesting, your three criteria, but it's not guaranteed that it's ethical, right?
23:01
Because data may be available, but still biased, and you may know the source or how it works, but still do bad things with your model. So I guess there also is a risk that people see a green light and think, okay, that's fine. But actually, if you don't go yourself, because there are many data sets that have been used that are known, and we still discover that there are a lot of issues
23:22
because nobody looked. So there's a thing where people didn't look closely into the data, although they were known already. My question is, your assistant, the LLM, is it one that you trained yourself, or did you use one, or how did you approach the LLM aspect?
23:41
So first of all, to the first comment, I mean, I completely agree. And that's what I tried to say earlier, maybe not clear enough. Ethical is a big topic, and I'm not saying this is all like super ethical according to every definition. But I think that transparency is important. If the source code is transparent, you can see what's happening. If the training data is transparent, you see what's in there.
24:03
I think this is the requirement to make sure that the system is good. Obviously, it is possible to have a transparent system. This is still shitty. Of course, I know. But at least then we have the way to see it and to improve it. And the second question, we're not in the position to train our own foundation models.
24:23
This indeed requires a lot of resources. But there are actually lots of them available according to our definition. The Llama model from Meta, but the training data is not transparent. But there are also other ones like the Falcon model or the Mistral model from France. So there are actually some available.
24:41
And what we do is we're trying to fine-tune them. For example, there is a collaboration we do in Germany with the state of Schleswig-Holstein, where they provide training data, and there is a process called fine-tuning, where we can build a model that is specialized for this use case. So we're not in the position to build our own foundation models.
25:05
Frank, thank you for the talk. Really interesting to see you pushing forward, especially running AI locally, right? To preserve privacy and letting companies and individuals to take control over their data.
25:23
Regarding having other requirements, allowing people to see training data. So I work with the OSI. We're working on the open-source AI definition, and we're hosting a lot of discussions around that. And that's a big topic around the training data
25:41
and the data in general. There are a lot of issues. It's not as simple as that. There are a lot of very sensitive data for training those models, and copyrightable work as well. So sometimes, maybe even oftentimes,
26:03
the ethical choice is not to provide the training data in full, because imagine medical records or anything that has sensitive data. Even though that training data can be valuable,
26:22
sometimes you have to transform it into synthetic data and hide, anonymize that data to make it valuable and to preserve privacy. So you might want to review that requirement, and I invite you and everyone here who wants to join those discussions at discuss.opensource.org
26:44
should think about those questions. Yeah, I had no time to mention that, but I really like it a lot that the OSI is working on an official open-source AI definition. And I was part of some workshops there already, and I'm really happy that it actually goes in a similar direction.
27:02
The question of training data, it is of course complicated. So, as you said, there is the aspect of privacy. I think this is relatively easy, because if you, let's say you take medical data which contains the names of the patients, you might hope that the names are not exposed, but it's possible.
27:25
So I think it's pretty clear that from a privacy perspective, the training data needs to be anonymized. From a copyright perspective, it's more complicated. I think this is still a process. I mean, there's a lawsuit in the US, several lawsuits.
27:41
The New York Times is suing OpenAI at the moment about exactly that. If this is a copyright violation, if you take proprietary data for training process, this is an ongoing discussion. We have, for example, a feature that I had no time to mention, but we can also classify music in Nextcloud, and you can upload your MP3s or whatever,
28:02
and then you can structure them in different, this is classic music, this is rock music, and so on. And for that, there's not a model from us, this comes from somewhere else. This is, for example, using music from Spotify, which obviously is copyrighted. And this is a big question, if this is a copyright violation or not.
28:21
But I think this is something for the legislation to figure out. Alright, thank you Frank. Thanks a lot.