This is a WIP community resource, containing contribution from community members.
Goal: Learn how to integrate a locally-running LLM into a Java project using llama.cpp as an inference backend, understand the agentic workflow architecture, and implement AI-powered natural language input for your app.
Local vs. hosted AI: Unlike the hosted AI approach which calls an external API, this tutorial runs the model entirely on the user's machine. There is no API key to manage and the app works fully offline — at the cost of requiring llama-server to be installed on the target machine.
This approach runs inference via llama.cpp's llama-server as an external subprocess. Your app spawns the server on startup, then communicates with it over HTTP on localhost using Java's built-in HttpClient — no extra AI library dependencies needed.
User input
│
▼
Parser.parse() ──── success ──▶ execute command directly
│ unknown command
▼
LlmInterpreter.interpret()
│ builds prompt from template
▼
LlmServer.complete() ──HTTP POST /completion──▶ llama-server subprocess
│ │
│ ◀──────────────── raw text ──────────────────────────
▼
validateAndNormalise()
│ strips preamble, checks known commands
▼
executeInterpreted() ──▶ Parser.parse() ──▶ execute
The key idea is that the LLM is only invoked as a fallback when the traditional parser cannot understand the input. This keeps exact commands fast and predictable, while natural language input is handled by the model.
java -version.llama-server binary must be on your PATH or in a standard Homebrew location:src/main/resources/models/. A good starting point is Qwen3.5-0.8B-Q4_K_M.gguf, which is fast on CPU:pip install huggingface_hub
huggingface-cli download Qwen/Qwen3.5-0.8B --local-dir ./src/main/resources/models
No Ollama, no API key, no extra Gradle dependencies. llama-server is a standalone binary and Java's java.net.http.HttpClient (available since Java 11) handles the HTTP calls. The model file can be bundled into your fat JAR as a classpath resource so end-users do not need to download it separately.
Create a class (e.g. LlmServer) to manage the entire lifecycle of the llama-server process.
Probe a list of candidate paths in priority order so the app works across platforms without manual configuration:
private static final String[] CANDIDATE_PATHS = {
"llama-server", // on PATH
"/opt/homebrew/bin/llama-server", // Homebrew Apple Silicon
"/usr/local/bin/llama-server", // Homebrew Intel
"/usr/bin/llama-server"
};
When running from an exploded build the model is used directly from the classpath. Inside a fat JAR it is extracted to a local cache directory on first run:
private File resolveModelFile() throws IOException {
URL resourceUrl = LlmServer.class.getResource(MODEL_RESOURCE);
if ("file".equals(resourceUrl.getProtocol())) {
return new File(resourceUrl.toURI()); // dev mode: use directly
}
// Fat-JAR: extract to local cache on first run
File dest = new File(CACHE_DIR, CACHED_MODEL_NAME);
if (!dest.exists()) {
try (InputStream in = LlmServer.class.getResourceAsStream(MODEL_RESOURCE)) {
Files.copy(in, dest.toPath(), StandardCopyOption.REPLACE_EXISTING);
}
}
return dest;
}
ProcessBuilder pb = new ProcessBuilder(
binary,
"--model", modelFile.getAbsolutePath(),
"--port", String.valueOf(SERVER_PORT),
"--ctx-size", "2048",
"--n-gpu-layers", "0", // set > 0 to use GPU
"--threads", String.valueOf(Runtime.getRuntime().availableProcessors() / 2),
"--no-webui",
"--log-disable"
);
pb.redirectErrorStream(true);
pb.redirectOutput(ProcessBuilder.Redirect.DISCARD);
serverProcess = pb.start();
After starting, poll GET /health every 500 ms (up to 60 s) until the server responds with HTTP 200 before marking it ready.
Add a complete() method that POSTs a JSON body to the /completion endpoint and extracts the "content" field from the response:
public String complete(String prompt, int maxTokens) {
String body = "{"
+ "\"prompt\":\"" + escapedPrompt + "\","
+ "\"n_predict\":" + maxTokens + ","
+ "\"temperature\":0.0,"
+ "\"stop\":[\"\\n\",\"<|im_end|>\",\"User:\"],"
+ "\"stream\":false"
+ "}";
HttpRequest request = HttpRequest.newBuilder()
.uri(new URI("http://127.0.0.1:" + SERVER_PORT + "/completion"))
.header("Content-Type", "application/json")
.timeout(Duration.ofSeconds(REQUEST_TIMEOUT_SECONDS))
.POST(HttpRequest.BodyPublishers.ofString(body))
.build();
HttpResponse<String> response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());
return parseContent(response.body()); // extracts "content" field via regex
}
Why temperature=0.0 and stop tokens? We want deterministic, single-line output. Setting temperature to 0.0 makes the model always pick the most likely token. The stop tokens (\n, <|im_end|>) cut off generation immediately after the first line, preventing the model from adding unwanted explanation.
Rather than hardcoding the prompt in Java, store it in a resource file (e.g. src/main/resources/nlp/prompt_template.txt) and load it once at class-init time:
private static final String PROMPT_TEMPLATE = loadPromptTemplate();
private static String loadPromptTemplate() {
try (InputStream is = LlmInterpreter.class.getResourceAsStream("/nlp/prompt_template.txt")) {
return new String(is.readAllBytes(), StandardCharsets.UTF_8);
}
}
The template uses ChatML format (required by Qwen and many other open models) and instructs the model to output only a valid command string. Suppose your app is a CLI-centric task management app — the template would look like:
<|im_start|>system
You are a task manager assistant. Convert user input into ONE valid app command string.
COMMANDS:
- todo <description>
- deadline <description> /by <date>
- event <description> /from <date> /to <date>
- list
- done <task_number>
- delete <task_number>
- help
RULES:
1. Output ONLY the command string, no explanation.
2. For greetings or chitchat, output: CHAT: <short friendly reply>
3. If no command fits, output: UNKNOWN
EXAMPLES:
remind me to buy groceries -> todo buy groceries
remove task 3 -> delete 3
what tasks do I have? -> list
hello -> CHAT: Hey there! Ready to tackle your to-do list?
<|im_end|>
<|im_start|>user
{USER_INPUT}
<|im_end|>
<|im_start|>assistant
At runtime, replace {USER_INPUT} with the actual user input before sending the prompt. Keeping the prompt in a text file (rather than Java source) makes it easy to iterate on without recompiling.
Small local models sometimes prepend preamble text (e.g., "Sure! Here is the command: todo buy milk"). Add a validateAndNormalise() method to handle this with a two-phase check:
String validateAndNormalise(String raw) {
String cleaned = raw.replaceAll("^['\"`]+|['\"`]+$", "").trim();
// Pass-through for chat responses
if (cleaned.regionMatches(true, 0, "CHAT:", 0, 5)) {
return cleaned;
}
// Phase 1: first word is a known command — accept directly
String firstWord = cleaned.split("\\s+")[0].toLowerCase();
if (KNOWN_COMMANDS.contains(firstWord)) {
return cleaned;
}
// Phase 2: scan for a command word after preamble
Matcher matcher = COMMAND_SCAN_PATTERN.matcher(cleaned);
if (matcher.find()) {
return cleaned.substring(matcher.start(1)).trim();
}
return null; // unrecognised — caller shows fallback message
}
KNOWN_COMMANDS should mirror every command keyword your parser recognises (plus any short aliases), so the whitelist stays in sync with the rest of the app.
Starting llama-server can take a few seconds. Initialise it on a background thread so your app remains responsive:
Thread llmInit = new Thread(() -> {
LlmInterpreter.getInstance().init(); // starts llama-server, blocks until ready
boolean ready = LlmInterpreter.getInstance().isReady();
// notify the UI or log the status
System.out.println(ready
? "Natural language mode ready."
: "LLM unavailable — exact commands still work.");
}, "llm-init");
llmInit.setDaemon(true);
llmInit.start();
For JavaFX apps, wrap the call in a Task<Void> and update the UI in setOnSucceeded to stay on the JavaFX thread.
When the user submits input, try the traditional parser first. Only fall back to the LLM if the command is not recognised:
// Try exact command parsing first — no LLM involvement
try {
String response = app.handleCommand(input);
display(response);
return;
} catch (UnknownCommandException ignored) {
// fall through to LLM
}
// LLM fallback — run off the main thread to avoid blocking
new Thread(() -> {
String interpreted = LlmInterpreter.getInstance().interpret(input);
handleLlmResult(interpreted);
}, "llm-interpret").start();
For commands that modify or delete data (e.g. delete, clear), consider asking the user to confirm before executing the AI-generated command:
if (isDestructive(candidate)) {
pendingCommand = candidate;
display("I interpreted that as: \"" + candidate + "\"\n"
+ "Type yes to confirm, or anything else to cancel.");
}
Place the model file in src/main/resources/models/ so Gradle includes it in the fat JAR automatically. Use the Shadow plugin:
plugins {
id 'application'
id 'com.github.johnrengelman.shadow' version '7.1.2'
}
shadowJar {
archiveBaseName = "myapp"
archiveClassifier = null
mergeServiceFiles()
}
Build it:
./gradlew shadowJar
The resulting JAR is self-contained. On first run, LlmServer extracts the model to a local cache directory automatically — no manual setup required by the user beyond having llama-server installed.
llama-server itself must still be installed on the target machine. Only the model weights are bundled in the JAR; the inference binary is a system dependency.
With the basic local AI integration in place, you can expand the capabilities further along the following directions:
LlmInterpreter.isReady() at startup. If llama-server is missing or the model fails to load, the app should still work with exact commands — consider showing a visible status indicator in the UI.llama-server requirement clearly, and detect and report it gracefully in-app if it is missing.EXAMPLES section of your template to cover edge cases specific to your command set.--n-gpu-layers to a positive number when launching llama-server to offload layers to the GPU for faster inference on supported hardware.Contributors: