ScrubData: Local Data Cleaning Plan

Discover ScrubData, a hands-off data cleaning tool using a small AI model to generate reversible, explained cleaning plans. See how it masks sensitive data and handles ambiguities, ensuring trustworthy data transformation.

Qwen3-4B-Instruct-2507 llama Ollama Gradio OpenTelemetry-GenAI

Video

Overview

ScrubData is a hands-off data-cleaning tool where a 4B model that runs on your laptop does the work — drop a messy spreadsheet, get clean data back with every change named, reversible, and explained, and anything sensitive masked on-device.

The key idea: the small model is a planner, never a row-by-row editor. Pandas profiles each column into a value→frequency distribution (scale-invariant — a million rows profile like a hundred), the model emits a structured JSON cleaning plan, canonical forms are grounded against reference taxonomies (GeoNames, ISO, ~100k entities) with abstention on ties, every mapping is verified by deterministic evidence, and only then does pandas execute. Nothing is silent.

Live, I’ll show: a real messy CSV cleaned end-to-end; the “YOUR CALL” cards where the model abstains on genuine ties (Slovia → Slovakia 86% vs Slovenia 86%) and hands you the decision; the audit grid of named, reversible edits; PII flagged and masked locally; the OpenTelemetry traces from real runs; and the eval harness. Repo, model, dataset, and traces are all public.

Links

https://huggingface.co/spaces/build-small-hackathon/scrubdata
This Hugging Face Space automatically cleans uploaded CSV/Excel files.
https://github.com/ricalanis/scrubdata-hackathon
ScrubData uses a local, fine-tuned Qwen3-4B LLM to clean data.

Tech stack

Qwen3-4B-Instruct-2507

An ultra-efficient 4-billion parameter language model optimized for rapid, non-thinking instruction following and 256K long-context reasoning.

Alibaba's Qwen3-4B-Instruct-2507 delivers high-tier performance in a compact 4.0-billion parameter footprint. Operating in a dedicated non-thinking mode (bypassing slow reasoning blocks to output answers immediately), this model excels at instruction following, code generation, and multilingual tasks across 100+ languages. Its standout feature is a massive 256K token context window, allowing developers to process entire codebases or dense documents locally on consumer-grade hardware without sacrificing speed or accuracy.

https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507

View projects
llama

Meta's open-weights LLM family optimized for high-performance local deployment and custom fine-tuning across 8B to 405B parameter scales.

Llama 3.1 delivers state-of-the-art performance through a flagship 405B parameter model trained on 15 trillion tokens. It supports a 128k context window: ideal for analyzing massive datasets or long-form documentation. Developers utilize Llama for diverse tasks (multilingual translation, Python code generation, and complex reasoning) while maintaining data sovereignty via local hosting. The ecosystem includes the Llama Stack for agentic workflows and optimized weights for 8B and 70B models, ensuring high throughput on consumer hardware or enterprise clusters.

https://llama.meta.com/

View projects
Ollama

Deploy and run open-source Large Language Models (LLMs) like Llama 3 and Mistral locally on your machine: achieve private, cost-effective AI via a simple command-line interface.

Ollama is the essential tool for running LLMs locally: consider it the Docker for AI models. It packages complex models and dependencies into a single, easy-to-use application for macOS, Linux, and Windows systems. You get immediate access to models like Gemma 2 and DeepSeek-R1 via a straightforward CLI or REST API. This local-first approach guarantees data privacy and security, eliminating cloud dependency and high API costs. Ollama also optimizes performance on consumer hardware using techniques like quantization, ensuring efficient execution even on standard desktops.

https://ollama.com

View projects
Gradio

Gradio is the open-source Python library for rapidly building and sharing interactive web UIs for any machine learning model or Python function.

Gradio is the essential tool for data scientists and ML engineers: it turns any Python function (including TensorFlow, PyTorch, and Hugging Face models) into a live, interactive web application with just a few lines of code. This open-source library eliminates the need for complex frontend development, handling all HTML, CSS, and JavaScript automatically. Developers define the function and specify inputs (e.g., 'text', 'image', 'slider') and outputs, then launch the interface locally, embed it in a notebook, or instantly generate a shareable public link. Gradio is widely adopted for quick prototyping, model demonstration, and deployment on platforms like Hugging Face Spaces, making complex models accessible to non-technical users for testing and feedback.

https://www.gradio.app/

View projects
OpenTelemetry-GenAI

OpenTelemetry-GenAI standardizes observability for generative AI applications by establishing a unified schema for tracking prompts, completions, token usage, and agent workflows.

As engineering teams integrate large language models into production, tracking performance requires more than basic HTTP metrics. OpenTelemetry-GenAI solves this by defining standard semantic conventions (specifically starting with version 1.37) to capture critical LLM metadata. The framework instruments client libraries to record input and output token counts, model names, prompt and completion payloads, and complex agent tool calls. By standardizing this telemetry, developers can export structured traces and metrics directly to their existing observability pipelines (such as Datadog or Honeycomb) without maintaining parallel, vendor-specific SDKs.

https://opentelemetry.io/docs/specs/semconv/gen-ai/

View projects