Stop Building AI Skyscrapers on a Data Swamp
Clean, connected data beats complex models. Start with naming, structure, and a simple lake→warehouse pipeline.
December 14, 2025 By Yi Li - A battery engineer
The battery industry's current discourse focuses almost exclusively on the flashy stuff: breakthrough solid-state chemistries, the promise of generative AI designing the next supercell, and groundbreaking machine learning models that can predict battery pack performance after 10 years with only one month of cycling data.
This is misguided. Instead, I like to look at our industry using Charlie Munger's principle of inversion—looking at a problem backwards. What we tend to ask is "How do we use AI to predict battery behaviors?". What we should really be asking, however, is "What is currently preventing us from knowing anything for certain about our battery packs?"
Right now, the main problem in the battery industry isn't a lack of algorithms. It's poor data infrastructure, which is often chaotic, fragmented, and, for the most part, ignored.
💡 We're trying to build AI skyscrapers on a data swamp.
Our industry needs to stop pretending our data are good enough. We need to focus instead on how to improve it and extract value from it through basic data management. I'll show that this can be done without spending a fortune and/or getting in bed with third parties who will invariably try to charge a fortune for their pie-in-the-sky products.
The unspoken gaps
The communication gap
In the battery world, collaboration across teams can be challenging. One area where this shows up is between management and engineering. This isn't simply about different working styles, it reflects a genuine communication gap where each group approaches problems from different perspectives.
a. The Clash: Deterministic Expectations vs. Probabilistic Reality
Leadership naturally craves results. They want robust predictions about pack durability to make business decisions. They ask binary questions: "Will this pack last 10 years? Yes or No?" This is an understandable desire for certainty. However, they often view the modeling process as a commodity—like pushing out a software update that IT can sort out in a week.
Engineers, on the other hand, know that physics doesn't answer in "Yes or No." We answer in confidence intervals and probabilities. When management asks for a guarantee, engineers hear a request for the impossible.
b. The Data Iceberg: The Invisible 70%
Management pays for the tip; engineers drown in the rest.
This mentality ignores the reality of the work. Management usually only sees the final result—the "flashy graph" in a PPT. They remain unaware of the Data Iceberg. For a battery engineer, ~70% (or even more) of the time is not spent modeling; it is spent wrestling with the inputs. It involves collecting data from multiple sources, filtering noise, aligning timestamps from different sensors, and patching messy datasets.
c. The Rise of "PowerPoint Engineering"
This invisibility creates a perverse incentive structure. If the "grind" is invisible, and success is judged solely on the final presentation, we stop optimizing for engineering truth and start optimizing for "Executive Presence."
We are seeing the rise of PowerPoint Engineering. In this culture, the engineer who spends weeks ensuring the data is mathematically sound often loses out to the engineer who spends their time adjusting fonts, colors, and animations. The "winner" is not the one with the robust model, but the one with the persuasive narrative.
d. The Joint Responsibility
We cannot fix this by simply blaming "the other side." Both parties are currently too ignorant to talk to each other effectively.
- Management's Duty: They must stop treating simulation as a black box. They need to learn to ask, "What is the quality of the input data?" rather than just "What is the result?"
- Engineering's Duty: We have a responsibility to educate. We often fail to articulate why the data cleaning takes 70% of the time.
The cell and pack manufacturing data gap
A battery pack is not just a box of cells. When a pack fails in the field:
- Was it the cell chemistry?
- Was it a welding defect during module assembly?
- Was it a failed BMS?
There's a natural IP and commercial-sensitivity barrier between cell makers and pack integrators. Once cells leave the factory, the performance data rarely flows back upstream.
The broken feedback loop between battery pack manufacturing and field performance data.
In the absence of this bridge, your highly skilled battery engineers stop being engineers and become firefighters. Without a digital thread tracing the end-of-line data from the cell manufacturer to the process data of the pack assembler to the on-field data collected by BMS, we are doomed to relive the same process again and again.
The map is not the territory
The Disconnect: Academia research vs. The Real World
Polish-American philosopher Alfred Korzybski made a simple observation: people often confuse a representation with the real thing. A map shows the territory, but it isn't the territory itself.
- Academia: clean labs, constant conditions, flawless datasets.
- Industry: heat, cold, vibration, noisy signals, missing data.
⚠️ The hard truth: A simple linear regression on clean data beats a deep neural network on messy data every single time.
Taming the Chaos: The cheap(ish) Fix
After six years working across cells, packs, and data systems, I have come to a realization: Battery engineering is not just an electrochemistry problem—it is a systems problem in disguise.
High-Level Architecture: The Refining Process
The Three-Layer Data Model
- Layer 1: The Wild West (Data Lake): Raw, immutable, and messy files (CAN logs, .ssn, etc.).
- Layer 2: Civilization (Data Warehouse): Clean, standardized .mat or .parquet files. The single source of truth.
- Layer 3: The Treasure Chest (Data Base): Processed results, summary tables, and insights for management.
The "Excel vs. SQL" philosophy
Instead of over-engineering a database, we can embrace a rigorous folder structure and naming conventions:
- The Folder is the Schema: Rigorous hierarchy.
- Convention over Configuration: Filename as a database record (e.g.,
Project_TestType_Date_PackID).
Pipeline Implementation: The Core Stages
- Stage 0: Discovery: Metadata indexing.
- Stage 1: Ingestion: Normalization to a unified format.
- Stage 2: Analytics: Core processing decoupled from loading.
- Stage 3: Reporting: Conversion of math into business intelligence.
Outlook
The battery industry is facing a regulatory tidal wave, spearheaded by initiatives like the EU Battery Passport. This strict traceability forces us to build the very "Digital Thread" our engineers have been begging for.
For the last ten years, the race was defined by electrochemistry. But the winners of the next decade won't just be the ones with the highest energy density; they will be the ones who have mastered the unglamorous, disciplined work of Data Management.
Chemistry gets you into the game, but data keeps you there.
#battery-engineering #data-management #AI #big-data #battery-industry