Running Local LLMs with Ollama and Node.js

Run LLMs locally without API costs using Ollama:

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model
ollama pull llama2
ollama pull codellama

Node.js Integration

npm install ollama

import { Ollama } from 'ollama';

const ollama = new Ollama();

// Simple completion
const response = await ollama.chat({
  model: 'llama2',
  messages: [{ role: 'user', content: 'Explain closures in JS' }]
});

console.log(response.message.content);

// Streaming
const stream = await ollama.chat({
  model: 'codellama',
  messages: [{ role: 'user', content: 'Write a fibonacci function' }],
  stream: true
});

for await (const chunk of stream) {
  process.stdout.write(chunk.message.content);
}

Memory Requirements

  • 7B models: 8GB RAM
  • 13B models: 16GB RAM
  • 70B models: 64GB+ RAM

Rate Limiting AI API Calls in Node.js with Bottleneck

Rate limiting is critical for AI APIs. Here’s a robust implementation:

import Bottleneck from 'bottleneck';

const limiter = new Bottleneck({
  reservoir: 60,           // 60 requests
  reservoirRefreshAmount: 60,
  reservoirRefreshInterval: 60 * 1000, // per minute
  maxConcurrent: 5,
  minTime: 100             // 100ms between requests
});

// Wrap OpenAI calls
const rateLimitedChat = limiter.wrap(async (prompt) => {
  return openai.chat.completions.create({
    model: 'gpt-4',
    messages: [{ role: 'user', content: prompt }]
  });
});

// Use with automatic queuing
const results = await Promise.all(
  prompts.map(p => rateLimitedChat(p))
);

Exponential Backoff

async function withRetry(fn, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (e) {
      if (e.status === 429 && i < maxRetries - 1) {
        await new Promise(r => setTimeout(r, Math.pow(2, i) * 1000));
      } else throw e;
    }
  }
}

Building Autonomous AI Agents with LangChain.js

LangChain agents can use tools autonomously. Here’s a complete agent setup:

import { ChatOpenAI } from '@langchain/openai';
import { AgentExecutor, createOpenAIToolsAgent } from 'langchain/agents';
import { DynamicTool } from '@langchain/core/tools';
import { ChatPromptTemplate } from '@langchain/core/prompts';

const tools = [
  new DynamicTool({
    name: 'calculator',
    description: 'Performs math calculations',
    func: async (input) => {
      return String(eval(input)); // Use mathjs in production
    }
  }),
  new DynamicTool({
    name: 'search',
    description: 'Search the web',
    func: async (query) => {
      // Your search API here
      return `Results for: ${query}`;
    }
  })
];

const llm = new ChatOpenAI({ model: 'gpt-4' });
const prompt = ChatPromptTemplate.fromMessages([
  ['system', 'You are a helpful assistant with access to tools.'],
  ['human', '{input}'],
  ['placeholder', '{agent_scratchpad}']
]);

const agent = await createOpenAIToolsAgent({ llm, tools, prompt });
const executor = new AgentExecutor({ agent, tools });

const result = await executor.invoke({
  input: 'What is 25 * 48 and search for Node.js news'
});

Implementing LLM Response Caching with Redis

Caching LLM responses saves money and improves latency:

import { createHash } from 'crypto';
import Redis from 'ioredis';

const redis = new Redis();
const CACHE_TTL = 3600; // 1 hour

function hashPrompt(messages, model) {
  const content = JSON.stringify({ messages, model });
  return createHash('sha256').update(content).digest('hex');
}

async function cachedChat(messages, options = {}) {
  const { model = 'gpt-4', bypassCache = false } = options;
  const cacheKey = `llm:${hashPrompt(messages, model)}`;

  if (!bypassCache) {
    const cached = await redis.get(cacheKey);
    if (cached) {
      console.log('Cache HIT');
      return JSON.parse(cached);
    }
  }

  console.log('Cache MISS');
  const response = await openai.chat.completions.create({
    model,
    messages
  });

  await redis.setex(cacheKey, CACHE_TTL, JSON.stringify(response));
  return response;
}

Semantic Caching

For similar (not exact) queries, use embedding similarity with a threshold.

Text Chunking Strategies for RAG Applications

Chunking strategies greatly affect RAG quality:

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';

// Basic chunking
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
  separators: ['

', '
', ' ', '']
});

const chunks = await splitter.splitText(document);

Semantic Chunking

async function semanticChunk(text, maxTokens = 500) {
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
  const chunks = [];
  let current = [];
  let tokenCount = 0;

  for (const sentence of sentences) {
    const tokens = sentence.split(/s+/).length; // Approximate
    if (tokenCount + tokens > maxTokens && current.length) {
      chunks.push(current.join(' '));
      current = [];
      tokenCount = 0;
    }
    current.push(sentence);
    tokenCount += tokens;
  }
  if (current.length) chunks.push(current.join(' '));
  return chunks;
}

Best Practices

  • Chunk size: 500-1000 tokens
  • Overlap: 10-20% for context
  • Preserve semantic boundaries

Building Conversational AI with Context Memory in Node.js

Building a conversational AI with memory requires careful context management:

class ConversationManager {
  constructor(options = {}) {
    this.maxTokens = options.maxTokens || 4000;
    this.systemPrompt = options.systemPrompt || 'You are a helpful assistant.';
    this.conversations = new Map();
  }

  getHistory(sessionId) {
    if (!this.conversations.has(sessionId)) {
      this.conversations.set(sessionId, []);
    }
    return this.conversations.get(sessionId);
  }

  async chat(sessionId, userMessage) {
    const history = this.getHistory(sessionId);
    history.push({ role: 'user', content: userMessage });

    // Trim history if too long
    while (this.estimateTokens(history) > this.maxTokens) {
      history.shift();
    }

    const response = await openai.chat.completions.create({
      model: 'gpt-4',
      messages: [
        { role: 'system', content: this.systemPrompt },
        ...history
      ]
    });

    const reply = response.choices[0].message.content;
    history.push({ role: 'assistant', content: reply });
    return reply;
  }

  estimateTokens(messages) {
    return messages.reduce((sum, m) => sum + m.content.length / 4, 0);
  }
}

Async waterfall example nodejs

To avoid callback hell , a very useful tool for structuring calls in a sequence and make sure the steps pass the data to the next step.

var async = require('async');

async.waterfall(
    [
        function(callback) {
            callback(null, 'Yes', 'it');
        },
        function(arg1, arg2, callback) {
            var caption = arg1 +' and '+ arg2;
            callback(null, caption);
        },
        function(caption, callback) {
            caption += ' works!';
            callback(null, caption);
        }
    ],
    function (err, caption) {
        console.log(caption);
        // Node.js and JavaScript Rock!
    }
);

nJoy 😉

Node.js Start script or run command and handoff to OS (no waiting)

Sometimes you want to run something in the OS from your node code but you do not want to follow it or have a callback posted on your stack. The default for child_process spawning is to hold on to a handle and park a callback on the stack. however there is an option which is called detached.

var spawn = require('child_process').spawn;
spawn('/usr/scripts/script.sh', ['param1'], {
    detached: true
});

You can even setup the environment of the call to add stdio and stderr pipes to the call and connect them to OS fs descriptors like this:

var fs = require('fs'),
    spawn = require('child_process').spawn,
    out = fs.openSync('./out.log', 'a'),
    err = fs.openSync('./err.log', 'a');

spawn('/usr/scripts/script.sh', ['param1'], {
    stdio: [ 'ignore', out, err ], // piping stdout and stderr to out.log
    detached: true
}).unref();

The unref disconnects the process. from the parent it equates to a disown in shell.

Thanks to O/P : https://stackoverflow.com/questions/25323703/nodejs-execute-command-in-background-and-forget

Also Ref: https://nodejs.org/api/child_process.html#child_process_child_process_spawn_command_args_options

and

https://github.com/nodejs/node-v0.x-archive/issues/9255

nJoy 😉

Identify OS on remote host

For nmap to even make a guess, nmap needs to find at least 1 open and 1 closed port on a remote host. Using the previous scan results, let us find out more about the host 192.168.0.115:

# nmap -O -sV 192.168.0.115

Output:

Starting Nmap 7.80 ( https://nmap.org ) at 2020-10-02 12:21 CEST
Nmap scan report for 192.168.0.115
Host is up (0.00023s latency).
Not shown: 991 closed ports
PORT      STATE SERVICE     VERSION
22/tcp    open  ssh         OpenSSH 5.1 (protocol 2.0)
80/tcp    open  http        Apache httpd 2.2.19 ((Unix) mod_ssl/2.2.19 OpenSSL/0.9.8zf DAV/2)
111/tcp   open  rpcbind     2 (RPC #100000)
139/tcp   open  netbios-ssn Samba smbd 3.X - 4.X (workgroup: WORKGROUP)
443/tcp   open  ssl/http    Apache httpd 2.2.19 ((Unix) mod_ssl/2.2.19 OpenSSL/0.9.8zf DAV/2)
445/tcp   open  netbios-ssn Samba smbd 3.X - 4.X (workgroup: WORKGROUP)
873/tcp   open  rsync       (protocol version 29)
2049/tcp  open  nfs         2-4 (RPC #100003)
49152/tcp open  upnp        Portable SDK for UPnP devices 1.6.9 (Linux 2.6.39.3; UPnP 1.0)
MAC Address: 00:26:2D:06:39:DB (Wistron)
Device type: general purpose
Running: Linux 2.6.X|3.X
OS CPE: cpe:/o:linux:linux_kernel:2.6 cpe:/o:linux:linux_kernel:3
OS details: Linux 2.6.38 - 3.0
Network Distance: 1 hop
Service Info: OS: Linux; CPE: cpe:/o:linux:linux_kernel:2.6.39.3


OS and Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
Nmap done: 1 IP address (1 host up) scanned in 14.58 seconds

nJoy 😉

How to quit ESXi SSH and leave background tasks running

In Linux when a console session is closed most background jobs (^Z and bg %n) will stop running when the parent ( the ssh session ) is closed because the parent sends a SIGHUP to all its children when closing (properly). Some programs can catch and ignore the SIGHUP or not handle it at all hence passing to the root init parent. The disown command in a shell removes a background job from the list to send SIGHUPs to.

In ESXi there is no disown command. However there is a way to close a shell immediately without issuing the SIGHUPs :

exec </dev/null >/dev/null 2>/dev/null

The exec command will run a command and switch it out for the current shell. Also this command will make sure the stdio and stderr are piped properly.

nJoy 😉