How to Govern Prompt, Context, and Fine‑Tuning Data in the “Acquire” Stage

In classic machine learning, “Acquire” is about training data. In the world of Generative AI, Acquire gets broader and more layered. You’re not just acquiring data once—you’re feeding prompts, retrieval contexts, and fine‑tuning corpora into systems that can remember, adapt, and generalise in ways that are hard to unwind.

If you don’t govern Acquire for GenAI, you can end up with IP leakage, privacy violations, and biased or unsafe behaviour baked into your models.

Broadening “Acquire” beyond training data

For GenAI systems, think of three input layers:

Base model training data Usually controlled by a vendor or foundation model provider. You generally can’t change this, but you can choose providers and understand high‑level training sources.
Fine‑tuning and instruction data Data you use to adapt a base model to your organisation’s needs (e.g., examples, instructions, domain‑specific texts).
Retrieval and context data (RAG) and prompts Documents and data retrieved at query time, plus the prompts users send and any “system prompts” you embed.

Your governance levers vary by layer. You can’t clean up the vendor’s base training set, but you can be very precise about what you add on top.

Set rules for fine‑tuning and instruction data

Fine‑tuning can be powerful and dangerous. It can encode your organisation’s style, knowledge, and behaviour deeply into a model.

You should define:

What is allowed: Organisation‑owned documentation, policies, manuals, FAQs, and other content that you have rights to and that doesn’t contain unnecessary personal data.
What must be excluded or handled with extreme care: Personal and sensitive data, third‑party IP without explicit rights for machine learning use, highly confidential or security‑sensitive material.

You also need a review process: who signs off on fine‑tuning datasets, and how do they check for hidden risks (e.g., casual inclusion of chat logs with personal details)?

Govern RAG context and retrieval sources

Retrieval‑augmented generation changes where knowledge lives: your model may stay mostly static, while the retrieval layer remains dynamic.

You should:

Curate knowledge sources Decide which repositories can be indexed (e.g., internal wikis, knowledge bases, policy libraries), and which must be excluded or segmented. Avoid indexing volatile, unvetted, or high‑risk sources by default.
Apply least‑privilege retrieval Use metadata and access controls so the retrieval layer only pulls documents appropriate for the user’s role and the use case. Prevent cross‑tenant or cross‑department leakage (e.g., one client’s documents showing up in another’s answers).

Governance here looks a lot like content and data governance—with the twist that retrieved content is being actively mixed into generated outputs.

Treat prompts and prompt libraries as governed assets

Prompts might look ephemeral, but they’re powerful design elements.

For shared or system prompts:

Define what they may contain No hard‑coded credentials, sensitive examples, or private context that might leak in outputs. Avoid embedding biased or inappropriate examples in few‑shot prompts.
Test them for safety and robustness Check how prompts behave under adversarial input. See whether they can be easily pushed into unsafe territory.

For prompt libraries:

Use version control and approvals for prompts used at scale (e.g., customer‑facing workflows).
Assign ownership: someone is responsible for maintaining and retiring prompts as policies and products evolve.

A simple “Acquire for GenAI” checklist

To operationalise this, create a short checklist that must be completed for every GenAI initiative:

What fine‑tuning or instruction data will be used? Who owns it?
What repositories will be indexed for retrieval? How are they classified?
How will prompts be designed, tested, and approved?
What categories of data are explicitly out of bounds across all layers?

Pilot this checklist with one project, refine it, and then standardise it. Done well, Acquire becomes one of your strongest levers for GenAI safety and compliance—without shutting down innovation.

If you’d like assistance or advice with your Data Governance implementation, or any other topic (Privacy, Cybersecurity, Ethics, AI and Product Management) please feel free to drop me an email here and I will endeavour to get back to you as soon as possible. Alternatively, you can reach out to me on LinkedIn and I will get back to you within the same day!