Hey, your LLM is leaking...| CISA reporting platform

LLM Low Code and Vector Database Leaks | How to report your breaches to benefit others

Sep 04, 2024

The rush for AI integration leads to use of ‘Low Code’ frameworks that leak data and exposed vector databases that can be reverse engineered for data

There are two parts to this story about how everyone is going about implementing AI in their setup.

How can my LLM leak data?

A common method to get LLM to leak data is prompt engineering. Reams have been written on that. You have well known prompt injection tools and databases, Check out Giskard, if you are interested in a good tool.

Oops, we were off on a tangential there!

Coming back, this week it is about reversing vector databases.

To get to know this risk, we will have to understand the following terms in the GenAI context:

RAG (Retrieval Augmented Generation): Simply put, you feed your own data to a pre-trained AI model and ask it to reply in the context of that data.

How do you feed your own data? Using vector databases.

Vector Databases: Again, to the chagrin of the purists, a vector database is essentially a store of numbers (these numbers are vectorised forms of the data that I have uploaded). When you upload the Harry Potter series of books to your AI for answering from, the books get split into smaller parts. You create an ‘embedding’ of each part. This embedding is nothing but a large series of numbers that only your AI will understand. Now, you store these numbers (embeddings) as a part of your database and term it ‘vector database’.

To the AI-novice-cyber-security-professional, this looks like a hash. Something that can be done in one direction, but cannot be reversed. And, for a long time, it was also thought of as a hash. However, life is not that simple for any cyber professional, right? So we have a nice little way in which we can reverse engineer this vector database into some form of original data, so you can see (in human readable format), that Harry’s owl was named Hedwig.

If you want to read the technical piece on how to do that, you can find it here.

Many vector databases are exposed to the internet and can be reverse engineered for PII / PHI and other confidential data.

Low Code

The second part to today’s story is about the use of low code. I was never a big fan of a no-code or low-code platform, but I understood that it brings software development to everyone. The challenge here is that developers of low code or no code platform tend to be people who do not put much thought into securing their code, like a professional developer would have done. (Remember the time you forgot the curing process for your wall and had the tiles fall out?)

Low code platforms, especially those offering simple AI integrations seem to be leaking info as well. The vector databases of these platforms are easily accessible and can be reverse engineered to obtain sensitive data.

Take Action:

Any organisation that develops AI based products, services, these two risks should be identified, evaluated and managed. One way to do that is to enhance your attack surface monitoring to search for exposed vector databases
In your list of software assets used, add all the low-code and no-code platforms. Identify the risks from these platforms. Conduct a VA/PT of your applications built using the platform, if that is feasible.

CISA’s cyber incident reporting portal

A centralised reporting portal (for the US) for reporting a cyber incident goes live

red and white labeled bottle — Photo by Markus Spiske on Unsplash

The US’s cyber defence agency - Cybersecurity and Infrastructure Security Agency (CISA) has launched an updated services portal where an organisation can voluntarily report a cyber breach.

This is a voluntary reporting platform, but CISA encourages organisations to report the same.

Take Action:

If you are an organisation based in the US, consider reporting any incident on this portal. It will, as per CISA, benefit you and help the broader community”

Share CyberInsights

CyberInsights