Module 07: Data Engineering Mindset

You don’t need to become a data engineer. You need to think like one for about twenty minutes.

The Mindset Before the Tools

The emerging skill isn’t “learn SQL” or “master a BI platform.” It’s learning to think about your data the way a data engineer would — then letting LLMs handle the implementation.

Here’s the analogy: when you renovate a house, you don’t need to know how to frame a wall or run electrical wiring. But you absolutely need to know what rooms you want, where the doors should go, and how the spaces flow together. You’re the architect. The contractor (AI) swings the hammer.

Same with data. You don’t need to write SQL queries from scratch or configure database indexes. But you need to understand:

What data you have and what it means
How pieces of data relate to each other
What questions you want to ask of your data
What “good” data looks like vs. what’s garbage

This module teaches the architectural thinking. The tools are secondary.

Schema Design Thinking

What Is a Schema?

A schema is the blueprint for how your data is organized. It defines: what kinds of things you’re tracking (entities), what you know about each thing (attributes), and how things relate to each other (relationships).

You already do this informally. If you have a spreadsheet with client info, invoices, and projects — those are entities. The columns are attributes. The fact that an invoice belongs to a client is a relationship.

The difference between a spreadsheet and a proper schema is explicitness. In a spreadsheet, relationships live in your head (“the client name in the invoice tab matches the name in the clients tab”). In a schema, they’re defined formally, which means an AI (or any tool) can navigate them correctly.

The Three Questions

When you’re thinking about any dataset, answer three questions:

1. What are the nouns?

These become your tables (or entities). Look for distinct things you’re tracking:

Clients, Projects, Invoices, Payments, Tasks, Time Entries, Expenses

Each noun gets its own table. If you find yourself putting wildly different kinds of information in the same table, you probably have two nouns hiding in one.

2. What do you know about each noun?

These become your columns (or attributes):

Client: name, email, phone, company, industry, notes, created_date
Invoice: number, client (which one?), date_issued, date_due, status, total

A good test: could this attribute apply to a different noun? If “company name” appears in both your Client table and your Invoice table, you probably want it defined once (in Client) and referenced from Invoice.

3. How do the nouns relate to each other?

These become your relationships:

A Client has many Projects (one-to-many)
A Project has many Invoices (one-to-many)
An Invoice has many Line Items (one-to-many)
A Time Entry belongs to one Project and one Task (many-to-one)

The word “many” is key. When you say “a Client has many Invoices,” you’re saying the relationship goes one direction — one client, multiple invoices. When you say “a Project has many Team Members and a Team Member works on many Projects,” that’s a many-to-many relationship, which needs a bridge table.

Worked Example: Freelance Consulting Data

Let’s say you’re a freelance consultant. You track clients, projects, hours, invoices, and expenses across various spreadsheets, email threads, and your accounting tool. Here’s how to think about the schema:

Entities:

clients          → The people/companies you work for
projects         → Engagements with specific scope and timeline
time_entries     → Hours logged to specific projects
invoices         → Bills sent to clients
invoice_items    → Line items on each invoice
expenses         → Costs incurred for projects
payments         → Money received against invoices

Relationships:

clients ──< projects       (one client, many projects)
projects ──< time_entries  (one project, many time entries)
projects ──< expenses      (one project, many expenses)
clients ──< invoices       (one client, many invoices)
invoices ──< invoice_items (one invoice, many line items)
invoices ──< payments      (one invoice, many payments)

The schema (SQLite):

CREATE TABLE clients (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    email TEXT,
    company TEXT,
    industry TEXT,
    hourly_rate REAL,
    notes TEXT,
    created_at TEXT DEFAULT (datetime('now'))
);

CREATE TABLE projects (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    client_id INTEGER NOT NULL REFERENCES clients(id),
    name TEXT NOT NULL,
    status TEXT DEFAULT 'active' CHECK(status IN ('active','completed','paused','cancelled')),
    start_date TEXT,
    end_date TEXT,
    budget REAL,
    notes TEXT
);

CREATE TABLE time_entries (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    project_id INTEGER NOT NULL REFERENCES projects(id),
    date TEXT NOT NULL,
    hours REAL NOT NULL CHECK(hours > 0),
    description TEXT,
    billable INTEGER DEFAULT 1
);

CREATE TABLE invoices (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    client_id INTEGER NOT NULL REFERENCES clients(id),
    invoice_number TEXT UNIQUE NOT NULL,
    date_issued TEXT NOT NULL,
    date_due TEXT NOT NULL,
    status TEXT DEFAULT 'draft' CHECK(status IN ('draft','sent','paid','overdue','cancelled')),
    notes TEXT
);

CREATE TABLE invoice_items (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    invoice_id INTEGER NOT NULL REFERENCES invoices(id) ON DELETE CASCADE,
    description TEXT NOT NULL,
    quantity REAL NOT NULL,
    unit_price REAL NOT NULL,
    amount REAL GENERATED ALWAYS AS (quantity * unit_price) STORED
);

CREATE TABLE expenses (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    project_id INTEGER REFERENCES projects(id),
    date TEXT NOT NULL,
    category TEXT NOT NULL,
    amount REAL NOT NULL,
    description TEXT,
    receipt_path TEXT
);

CREATE TABLE payments (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    invoice_id INTEGER NOT NULL REFERENCES invoices(id),
    date TEXT NOT NULL,
    amount REAL NOT NULL,
    method TEXT,
    reference TEXT
);

You don’t need to write this SQL yourself. The point is understanding the structure well enough to describe it to Claude: “I have clients who have projects. Each project has time entries and expenses. I invoice clients and track payments against invoices.” Claude writes the SQL. You verify the structure matches your mental model.

Data Quality (Or: Garbage In, Confident Garbage Out)

Structural vs. Semantic Validation

Structural validation checks format: Is this a valid email address? Is this date in the right format? Is this number positive? These are easy and tools handle them automatically.

Semantic validation checks meaning: Does it make sense for a client to have an hourly rate of $0.01? Should an invoice be dated in the future? Can a time entry have more than 24 hours? These require understanding what the data means, not just what it looks like.

AI is good at structural validation. Semantic validation requires you to define the rules, because they depend on your business context. A $0.01 hourly rate might be a data entry error — or it might be a pro bono engagement. Only you know.

The “Would I Trust This?” Test

Before querying any dataset for business decisions, ask: “If I pulled one random row from this table, would I trust it?”

If the answer is “probably” — you’re fine for exploration and rough analysis. If the answer is “maybe” — clean the data first, or at minimum know which columns are reliable and which aren’t. If the answer is “no” — stop. AI will happily generate beautiful charts from terrible data. It doesn’t flinch. It doesn’t squint at a number and think “that can’t be right.” That’s your job.

Common Data Quality Issues (And What to Do)

Duplicates: The same client entered twice with slightly different names. Solve by defining a unique identifier (email is usually better than name).

Missing values: Some entries have fields left blank. Decide: is blank meaningful (e.g., “no phone number on file”) or is it missing data? Handle them differently.

Inconsistent formats: “United States,” “US,” “U.S.A.” in the same column. Standardize during import, not during queries.

Stale data: Information that was correct when entered but isn’t anymore. Client changed email, project ended but status wasn’t updated. Regular review cadence matters.

Using LLMs as Your Data Layer

For many solo-operator use cases, you don’t need a dedicated BI platform. The industry calls this pattern “agentic data analysis” — the LLM autonomously writes queries, runs them, checks the output, and iterates. It’s exactly what we’re teaching here, just with a fancier name. Here’s the workflow.

Step 1: Gather Your Data

Export from wherever it lives: spreadsheets, accounting tool CSV exports, project management tool exports, email contacts. Get it into files — CSV is the universal format.

Step 2: Describe the Schema to Claude

You don’t need to write SQL. Describe your data in plain English:

“I have three CSV files. clients.csv has client names, emails, companies, and hourly rates. projects.csv has project names, which client they belong to, start and end dates, and budgets. time_entries.csv has dates, hours, project names, and descriptions. I want to be able to answer questions like ‘which clients generated the most revenue last quarter’ and ‘how many hours did I spend on each project this month.’”

Claude will:

Design the schema
Write the SQL to create the tables
Write import scripts for your CSVs
Handle data cleaning issues it spots

Remember the three-questions framework from earlier in this module? That’s doing half the work here — you’re just translating your answers into a conversation with Claude.

Step 3: Import and Validate

Have Claude build a SQLite database and import your data. SQLite is the right choice here because: it’s a single file (no server to manage), it’s built into Python and most languages, it’s fast enough for any personal dataset, and it’s completely free.

After import, ask Claude to run validation queries: “Show me any clients with duplicate emails,” “Are there time entries with more than 12 hours?”, “Which projects have entries but no client?”

Fix what’s broken before you start querying for insights.

Step 4: Ask Questions

Now you have a structured database. Ask questions conversationally:

“Which 5 clients generated the most revenue in the last 6 months?”
“What’s my average hourly rate across all projects?”
“Show me monthly revenue trends for the last year”
“Which projects went over budget?”
“What percentage of my time is billable vs. non-billable?”

Claude writes the SQL, runs it, and returns results. You never touch SQL unless you want to.

Step 5: Visualize

For quick charts: ask Claude to generate a visualization. “Plot monthly revenue as a bar chart” and Claude writes Python/JavaScript that produces the chart.

For ongoing dashboards: this is where dedicated tools earn their place. See below.

When This Approach Works

This “Claude as data layer” approach works well when:

Your dataset is under ~100K rows (beyond this, you want a real database server)
You’re asking ad-hoc questions, not building a live dashboard
The data changes infrequently (weekly or monthly updates, not real-time)
You’re one person — no need for multi-user access

If you need real-time data, live dashboards, or multi-user access, you need a proper tool. But for personal analytics and business intelligence? This is available right now with zero additional software.

Dashboard Tools (When You Outgrow Conversations)

When you need persistent views of your data that update automatically, conversation-based querying isn’t enough. You need a dashboard tool.

Metabase (Self-Hosted)

What it is: Open-source business intelligence platform. Self-hosted (free) or cloud ($85/mo). Visual query builder + SQL access. Connects to most databases.

Why it’s interesting for you: You can point it at your SQLite database (or Postgres, MySQL, etc.) and build dashboards without writing code. The visual query builder lets you drag-and-drop to create queries, or switch to raw SQL for complex analysis.

Cloud version adds: Metabot AI — a natural language query interface. “Show me revenue by client for Q4” produces a query and chart.

Setup complexity: Self-hosted requires Docker and ~15 minutes. Cloud requires a credit card and ~2 minutes.

Polymer

What it is: Upload a CSV, get an interactive dashboard. No setup, no SQL, no database.

Why it’s interesting: Absolute lowest barrier to entry. If you have a spreadsheet and want a dashboard in 5 minutes, Polymer delivers. It auto-detects data types, suggests visualizations, and lets you filter and pivot interactively.

Limitation: It’s a spreadsheet visualizer, not a database tool. Complex joins across multiple data sources aren’t its strength.

Price: ~$20/month.

Julius AI

What it is: Chat-based data analysis. Upload data, ask questions, get charts and insights.

Why it’s interesting: Similar to “Claude as data layer” but with a dedicated UI, persistent sessions, and built-in visualization. Good for exploration.

Price: Free tier available; paid plans for more features.

Google Looker Studio

What it is: Free dashboard builder from Google. Connects to Google Sheets, BigQuery, and 800+ connectors.

Why it’s interesting: If your data already lives in Google Sheets, Looker Studio adds dashboard visualization at zero cost. The AI enhancements are improving but not yet transformative.

Limitation: The learning curve is steeper than Polymer, and it’s most powerful within the Google ecosystem.

Decision Matrix

Situation	Best Tool
Ad-hoc questions about data files	Claude/Cowork (no extra tool needed)
Quick dashboard from a spreadsheet	Polymer (~$20/mo)
Chat-based exploration of data	Julius AI (free tier)
Persistent dashboards with SQL access	Metabase (free self-hosted)
Data already in Google ecosystem	Looker Studio (free)
Enterprise BI needs (probably not you)	Tableau, Power BI

Resource	Why You’d Read It
LLMs Transforming Data Engineering (Medium)	How LLMs are changing what you need to know about data
Data Engineering in 2026: What Changes (Gradient Flow)	How data systems are being rebuilt for AI consumption
Metabase	If you want to try the self-hosted BI route
Polymer	If you want CSV-to-dashboard in 5 minutes