Data Governance¶
This page describes precisely what data MethodAtlas processes, what is and is not submitted to external AI providers, and how to configure the tool to meet specific data governance requirements.
Data processed locally (always)¶
The following operations are performed on the scan host and involve no external communication under any configuration:
| Operation | Data involved |
|---|---|
| Source file traversal | File paths and names within the scan root |
| Source file parsing (Java, C#, TypeScript) | Test source file content — in memory only |
| Method discovery | Parsed AST nodes — in memory only |
| Content hash computation | SHA-256 of the AST string — no content transmitted |
| CSV / SARIF / plain-text output | Result data written to stdout or a local file |
| Cache read and write | Local CSV file read and written on the scan host |
| Override file processing | Local YAML file read on the scan host |
Delta report (-diff) |
Two local CSV files compared in memory |
None of the above steps initiate any network connection.
Data submitted to external AI providers¶
When AI enrichment is enabled with -ai and a non-local provider is
configured, MethodAtlas submits one HTTPS request per test class to the
provider's inference API. Each request contains exactly:
-
The taxonomy text — either the built-in taxonomy or the content of the file supplied with
-ai-taxonomy. This is configuration data describing tag definitions; it contains no project-specific content. -
The list of test method names — the exact set of JUnit methods discovered by the parser in that class, with their source line numbers. This list is included to prevent the AI from inventing or omitting methods; only methods the parser found are classified.
-
The test class source file — the full text of one test source file (Java, C#, or TypeScript) from the scan root, used as semantic context for classification. The file is truncated to the character limit set by
-ai-max-class-chars(default: 40 000 characters) before transmission. The class name and all method names are always included; if the class body exceeds the limit, the trailing lines of the file are omitted.
In concrete terms, a single request to the AI provider contains:
- The class name (e.g.
com.example.AuthServiceTest) - All test method names found in that class (e.g.
loginWithExpiredToken,loginWithValidCredentials) - The full source text of that class file, up to
ai-max-class-charscharacters
Nothing else from the project is included.
What is never submitted:
| Data category | Included in AI request |
|---|---|
| Production source code | No |
Build scripts (pom.xml, build.gradle) |
No |
Configuration files (.properties, .yaml, environment files) |
No |
| Credentials, secrets, or API keys | No |
| Database schemas or migration scripts | No |
| Infrastructure definitions (Terraform, Kubernetes manifests) | No |
| Other test files submitted as context | No — each class is submitted independently |
| File paths or directory structure | No — only the class source text and method names are included; the absolute path on disk is not transmitted |
The AI provider receives the text of one test class (Java, C#, or TypeScript) at a time. No information about the surrounding project structure, the production implementation, or any other file is included.
Provider data processing policies¶
Each provider processes submitted data according to its own terms of service and data processing agreement. The following links point to the relevant policy documents at the time of writing; verify the current version before approving a provider for use:
| Provider | Data processing policy |
|---|---|
| OpenAI | openai.com/policies/api-data-usage-policies |
| Anthropic (Claude) | anthropic.com/legal/privacy |
| Azure OpenAI | learn.microsoft.com — Azure OpenAI data privacy |
| Mistral AI | mistral.ai/terms |
| Groq | groq.com/privacy-policy |
| xAI | x.ai/legal/privacy-policy |
| GitHub Models | docs.github.com — GitHub Models usage |
| Ollama (local) | No data leaves the host |
Policy changes
Provider data processing policies change over time. Review the linked documents and consult your legal or privacy team before approving a provider for scanning test code that may contain project-specific logic.
GDPR considerations¶
Test source files typically do not contain personal data. However, test files that use hard-coded real names, email addresses, or other personal data as test fixtures do contain personal data and are subject to GDPR data minimisation and transfer requirements when submitted to a hosted AI provider.
Recommended practices:
- Use randomised or obviously synthetic test data in test fixtures
(e.g.
alice@example.com,1970-01-01) rather than real personal data. - If test files contain unavoidable personal data and must be scanned, use either the Ollama local provider (no data leaves the host) or the Manual AI workflow (no outbound connections from the scan host).
- For EU-to-US transfers: Azure OpenAI with a European region endpoint processes data within the EU; Mistral AI is operated from the European Union. Both are suitable for organisations with EU data residency requirements.
Data residency¶
MethodAtlas supports three data-residency tiers. Choose the tier that matches your organisation's data governance requirements.
Tier 1 — data never leaves the machine¶
Configure -ai-provider ollama with -ai-base-url pointing to an Ollama
server on the local host or internal network. All AI inference runs on your
infrastructure. No API key is required. No outbound connection is made.
This is the appropriate choice for: air-gapped environments, strict DLP policies, or any scenario where no test source code may leave the organisation's network.
See Air-Gapped and Offline Deployment for setup instructions.
Tier 2 — operator controls what leaves the network (manual workflow)¶
Use -manual-prepare to produce prompt files on the scan host (no network required),
then carry those files to an authorised workstation with internet access, interact
with the AI chat interface there, and return the responses to the scan host for
-manual-consume. The operator decides exactly which prompts — and therefore which
class sources — are submitted to the AI provider, and when.
This is the appropriate choice for: regulated environments with supervised AI access, teams that require human sign-off before any data leaves the network, or pipelines where the scan host has no internet connectivity at all.
See Manual AI Workflow for the complete procedure.
Tier 3 — data is processed by a cloud AI provider¶
When a hosted provider is configured, class source files are transmitted to that provider's inference API over HTTPS. The provider's data processing policies govern what is retained and for how long.
| Provider | Data residency |
|---|---|
ollama |
Fully local — data never leaves the host |
azure_openai |
Customer's chosen Azure region; EU regions available |
mistral |
European Union |
openai |
OpenAI infrastructure (US) |
github_models |
Microsoft Azure infrastructure |
groq |
Groq infrastructure (US) |
xai |
xAI infrastructure (US) |
For organisations with strict EU data residency requirements, ollama,
azure_openai (with a EU region endpoint), and mistral are the appropriate
choices.
DLP-compatible deployment¶
Data Loss Prevention (DLP) controls that block or inspect outbound traffic are fully compatible with MethodAtlas. Use Tier 1 (Ollama) or Tier 2 (manual workflow) from the data residency options above — neither configuration initiates any outbound connection from the scan host.
For Tier 3 (cloud providers), AI calls can be routed through an internal proxy or HTTPS inspection point; configure the proxy via standard environment variables (HTTPS_PROXY, NO_PROXY) before invoking MethodAtlas.
See Air-Gapped and Offline Deployment for complete implementation guidance.
Auditing outbound AI calls¶
In environments that log or inspect outbound HTTPS traffic, MethodAtlas AI calls can be identified by the following characteristics:
| Characteristic | Value |
|---|---|
| Destination | The provider's API base URL (see AI Providers) |
| HTTP method | POST |
| Request frequency | One request per test class; multiple classes may be processed in rapid succession |
| Request size | Proportional to the test class source length; bounded by -ai-max-class-chars (default 40 000 characters) |
| Content type | application/json |
Enable logging of request URLs and sizes at the network proxy or firewall level to produce an audit trail of what was submitted and when.
Enterprise secret management¶
The -ai-api-key-env <name> flag reads the API key from a named environment variable. This is the recommended approach for CI pipelines (where the secret is stored in the CI secret store and injected as an environment variable at runtime), but it may not satisfy security policies in environments where environment variables are prohibited as a secret delivery mechanism (e.g. some PCI-DSS or CyberArk-governed workloads).
For deployments with stricter secret management requirements, the following patterns are available:
HashiCorp Vault / AWS Secrets Manager / Azure Key Vault: retrieve the API key before invoking MethodAtlas and pass it via the environment variable pattern:
# Vault example (adjust for your auth method)
export MY_API_KEY=$(vault kv get -field=api_key secret/methodatlas/openai)
./methodatlas -ai -ai-provider openai -ai-api-key-env MY_API_KEY src/test/java
unset MY_API_KEY
The unset immediately after the run limits the variable's lifetime to the single invocation. In containerised CI environments the variable is scoped to the container process and is not visible outside it.
File-based secret delivery (CyberArk Conjur, Kubernetes Secrets mounted as files): read the key from the file and export it to the environment immediately before the MethodAtlas call. Avoid writing the key to any persistent storage.
Ollama / Manual workflow (no API key required): for the strictest zero-secret requirement, use local Ollama inference or the Manual AI Workflow. Neither approach requires an API key on the scan host.
Further reading¶
- AI Providers — provider configuration and base URLs
- Air-Gapped and Offline Deployment — zero-egress deployment
- Manual AI Workflow — classification without network access from scan host
- Compliance & Standards — framework-specific evidence requirements