His post: https://discuss.tchncs.de/post/17400082
I’m just convinced all of y’all asking about this are in a huge circle jerk that never ends, but refuses to understand how it all works.
A model is a model. It’s a simplified way of narrowing down thresholds of confidences. It’s a pretty basic sorting algorithm that runs super fast on accelerated hardware.
You people seem to think it’s like fucking magic that steals your soul.
Don’t send information over the wire, and you’re golden. Learn how it works, and stop asking dumb questions like this is all brand new, PLEASE.
There is a difference between a general scare about the AI buzzword and legitimate distrust in online services which are closely connected to american spying institutions (regardless if they are ai or not)
If my calories tracker app would apoint a (former) NSA official on their board, I would be looking for alternatives too. This is not about AI, this is about a company with huge sets of private data being closely interconnected with american spy institutions.
Sad that you don’t seem to be able to distinguished between legitimate security questions and badly informed hypes/scares ass soon as a buzzword like AI occurs
Read the last part of my comment again. Seems I very clearly grasp the concern.
I did read this part, and while this is generally true, there are use cases of such large models. Some of them require the input of personal data (find bugs in my code, formalize this email, scan this picture for text and translate it, draw an anime version of this picture of my friend tom)
So people being weary of security implications of such large models are certainly not
in a huge circle jerk that never ends, but refuses to understand how it all works.
Sure you can just call them all dumb using ai like the mainstream (putting in personal data) and attribute it to an unwillingness to understand, but this doesn’t match the reality. Most people don’t even understand how an operating system functions, which components work online and which offline and who can access which of their information, let alone know how “AI” works and what the security implications are.
So If people ask those questions, hoping there are alternatives they can use safely your answer “no, u just dumb, machine can’t harm you, its not magic, just don’t put in data in”
Is not only rude but also missing the point. Most usefull/fun/mainstream ways DO in fact, put in data.
You explaining basic models also doesn’t help, as the concern here is not mainly/only the model, but american spy institution to access all prompts you did put in, maybe categorizing you in personality clusters dependent on your usage of language or assigning tags on which political stance a users has (and with entities like the NSA I could imagine far worse)
Also “A model is a model” Is not very accurate in such cases. When someone has control and secrecy over each aspect of the model, it would be very well possible for entities like the NSA to manipulate the content the models puts out in arbitrary directions. A government controlling and manipulating information the public receives is a red flag for a lot of people (rightfully so IMHO)
How are people supposed to get better in digital privacy topics if you just tell them to shut up and insult them when they aks questions trying to learn? You acting like you are in your Elfenbeinturm of genius isn’t helping anyone.
Edward Snowden isn’t god
I know that’s a shock to some…
Anyway I think he means self hosted options. I would recommend Ollama with a frontend
There are VERY FEW fully open LLMs. Most are the equivalent of source-available in licensing and at best, they’re only partially open source because they provide you with the pretrained model.
To be fully open source they need to publish both the model and the training data. The importance is being “fully reproducible” in order to make the model trustworthy.
In that vein there’s at least one project that’s turning out great so far:
Fortunately, LLMs don’t really need to be fully open source to get almost all of the benefits of open source. From a safety and security perspective it’s fine because the model weights don’t really do anything; all of the actual work is done by the framework code that’s running them, and if you can trust that due to it being open source you’re 99% of the way there. The LLM model just sits there transforming the input text into the output text.
From a customization standpoint it’s a little worse, but we’re coming up with a lot of neat tricks for retraining and fine-tuning model weights in powerful ways. The most recent bit development I’ve heard of is abliteration, a technique that lets you isolate a particular “feature” of an LLM and either enhance it or remove it. The first big use of it is to modify various “censored” LLMs to remove their ability to refuse to comply with instructions, so that all those “safe” and “responsible” AIs like Goody-2 can turned into something that’s actually useful. A more fun example is MopeyMule, a LLaMA3 model that has had all of his hope and joy abliterated.
So I’m willing to accept open-weight models as being “nearly as good” as a full-blown open source model. I’d like to see full-blown open source models develop more, sure, but I’m not terribly concerned about having to rely on an open-weight model to make an AI system work for the immediate term.
Is abliteration based off the research by the Anthropic team? When they got Claude to say it was the golden gate bridge?
Ironically, as far as I’m aware it’s based off of research done by some AI decelerationists over on the alignment forum who wanted to show how “unsafe” open models were in the hopes that there’d be regulation imposed to prevent companies from distributing them. They demonstrated that the “refusals” trained into LLMs could be removed with this method, allowing it to answer questions they considered scary.
The open LLM community responded by going “coooool!” And adapting the technique as a general tool for “training” models in various other ways.
I suppose the importance of the openness of the training data depends on your view of what a model is doing.
If you feel like a model is more like a media file that the model loaders are playing back, where the prompt is more of a type of control over how you access this model then yes I suppose from a trustworthiness aspect there’s not much to the model’s training corpus being open
I see models more in terms of how any other text encoder or serializer would work, if you were, say, manually encoding text. While there is a very low chance of any “malicious code” being executed, the importance is in the fact that you can check the expectations about how your inputs are being encoded against what the provider is telling you.
As an example attack vector, much like with something like a malicious replacement technique for anything, if I were to download a pre-trained model from what I thought was a reputable source, but was man-in-the middled and provided with a maliciously trained model, suddenly the system I was relying on that uses that model is compromised in terms of the expected text output. Obviously that exact problem could be fixed with some has checking but I hope you see that in some cases even that wouldn’t be enough. (Such as malicious “official” providence)
As these models become more prevalent, being able to guarantee integrity will become more and more of an issue.
Even if you trained the AI yourself from scratch you still can’t be confident you know what the AI is going to say under any given circumstance. LLMs have an inherent unpredictability to them. That’s part of their purpose, they’re not databases or search engines.
if I were to download a pre-trained model from what I thought was a reputable source, but was man-in-the middled and provided with a maliciously trained model
This is a risk for anything you download off the Internet, even source code could be MITMed to give you something with malicious stuff embedded in it. And no, I don’t believe you’d read and comprehend every line of it before you compile and run it. You need to verify checksums
As I said above, the real security comes from the code that’s running the LLM model. If someone wanted to “listen in” on what you say to the AI, they’d need to compromise that code to have it send your inputs to them. The model itself can’t do that. If someone wanted to have the model delete data or mess with your machine, it would be the execution framework of the model that’s doing that, not the model itself. And so forth.
You can probably come up with edge cases that are more difficult to secure, such as a troubleshooting AI whose literal purpose is messing with your system’s settings and whatnot, but that’s why I said “99% of the way there” in my original comment. There’s always edge cases.
That would be part of what’s required for them to be “open-weight”.
A plain old binary LLM model is somewhat equivalent to compiled object code, so redistributability is the main thing you can “open” about it compared to a “closed” model.
An LLM model is more malleable than compiled object code, though, as I described above there’s various ways you can mutate an LLM model without needing its “source code.” So it’s not exactly equivalent to compiled object code.
The importance is being “fully reproducible” in order to make the model trustworthy.
Well that’s a problem, because even with training data that’s impossible by design.
Not just LLMs but all kinds of models are equivlant to freeware, aka the model itself and other essential bits for it to work. I won’t even call it source avaliable as there is no source.
Take redis as example. I can still go grab the source and compile a binary that works. This doesn’t applies on ML models.
Of course one can argue the training process isn’t determistic thus even with the exact training corpus, it can’t create the same model in terms of bits on mulitple runs. However, I would argue the same corpus provide the chance to train a model of similar or equivalent performance. Hence the openness of the training corpus is an absolute requirement to qualify a model being FOSS.
I’ve seen this said multiple times, but I’m not sure where the idea that model training is inherently non-deterministic is coming from. I’ve trained a few very tiny models deterministically before…
You sure you can train a model deterministically down to each bits? Like feeding them into sha256sum will yield the same hash?
What’s FOSS-AI? A model everyone can download and use for free? Or in the OSS spirit that everything need to be open and without discrimination of use, aka OSS training data corpus and no AUP attached?
Or you mean the inference engine running those models?
Everything which is not BigTech. Preferably FOSS, at least not BigTech, just alternatives to for example OpenAI.
So you’re including free models like freeware, not FOSS only, by non big tech.
Your choice of models will be quite limited as the compute resource and training corpus needed to make a viable base model isn’t anyone can do.
Agreed, but there have been big projects that have been open source. I can imagine* an AI (LLM) being developed fully FOSS. It would be rare, but I can see it happening if a big foundation got behind it. Maybe Mozilla, or another that tries to keep the spirit of their mission statement.
*Imagine: I’m not too familar with all of the current, public, and free models out there, just a few. This was just me making a hopeful guess about if it might be actually happening now.