Back to Blog

How We Built an AI That Understands Warehouses

How We Built an AI That Understands Warehouses

When we started building OneTrack, the conventional wisdom was that general-purpose computer vision models would work for warehouses. Take a model trained on ImageNet, fine-tune it on some forklift images, ship it. That approach failed almost immediately.

Warehouses are not like the datasets these models were trained on. The lighting is harsh and inconsistent. The cameras are mounted on moving equipment, not tripods. The objects you need to detect are industrial equipment covered in grease, not crisp consumer products on white backgrounds. And the behaviors you need to recognize happen over sequences of frames, not single snapshots.

This is the story of how we built computer vision systems that actually work in warehouse environments. Not a marketing overview. A technical explanation of the architecture, the training data, and the hard problems we had to solve.

The Training Data Problem

Every machine learning engineer knows the mantra: your model is only as good as your training data. For warehouses, this creates a fundamental challenge. The public datasets that power most computer vision research do not contain forklifts, reach trucks, order pickers, pedestrians in safety vests, pallet jacks, racking systems, or dock doors. ImageNet has 1,000 object categories. None of them are "operator using phone while driving a forklift."

We had to build our training dataset from scratch. As of today, our models have been trained on tens of millions of video frames captured from real warehouse operations across more than 150 facilities.

The scale matters, but diversity matters more. A model trained on one warehouse learns that warehouse. It learns the specific lighting conditions, the specific equipment colors, the specific racking layout. Deploy it somewhere else and performance degrades.

Our training data spans:

Lighting conditions. Warehouses range from well-lit modern facilities with LED high bays to older buildings with sodium vapor lights casting everything in orange. Some have natural light from skylights that changes throughout the day. Some have dark corners where shadows dominate. Freezer warehouses have condensation on lenses. Loading docks flip between interior lighting and blinding outdoor sun as doors open.

Equipment diversity. The market has dozens of forklift manufacturers and hundreds of models. Counterbalance trucks, reach trucks, order pickers, turret trucks, pallet jacks. Colors range from brand-specific yellows and greens to custom paint jobs. Our models need to recognize "forklift" regardless of whether it is a 2005 Hyster or a 2024 Toyota.

Facility layouts. Dense storage facilities look different from cross-dock operations. Narrow aisle warehouses have different traffic patterns than bulk storage. Food and beverage facilities run pallets differently than automotive parts distribution. The visual environment varies dramatically.

Shift patterns. Activity during peak hours looks different from overnight maintenance windows. Shift changes create congestion patterns that do not exist at 3 AM. Seasonal spikes change how densely packed the aisles become.

Building this dataset required deploying sensors, labeling millions of events, and continuously expanding coverage as we encountered new environments. This is not a one-time investment. Every new deployment adds training data that improves the models for everyone in the network.

Embedding Representations: Beyond Pixel Matching

Early computer vision systems worked by matching pixels. Take a template image, slide it across the input, look for correlation. This approach is brittle. Change the lighting slightly and correlation drops. Rotate the object a few degrees and it fails to match.

Modern approaches work differently. Instead of comparing pixels directly, we convert images into embeddings. High-dimensional vector representations that capture semantic meaning rather than surface appearance.

Think of an embedding as a fingerprint for concepts rather than pixels. A forklift photographed from the front and a forklift photographed from the side look completely different at the pixel level. But their embeddings are close together in vector space because they represent the same underlying concept.

This is how our models generalize across sites. The system does not memorize what forklifts look like at Site A and then fail at Site B because the equipment is different. It learns the abstract concept of "forklift" and recognizes it regardless of specific appearance.

The technical approach combines convolutional feature extractors with attention mechanisms that learn which parts of an image matter for classification. The output is a dense vector, typically 512 to 2048 dimensions, that represents the semantic content of the input.

For warehouse applications, we train these embeddings on domain-specific tasks. Not "is this a dog or a cat" but "is this operator looking in the direction of travel" or "is there a phone in the operator's hand." The embedding space is optimized for the distinctions that matter in industrial environments.

One concrete example: distinguishing a phone from a scanner. Both are handheld devices. Both have screens. At the pixel level, they can look similar. But in our embedding space, they are well-separated because we trained on thousands of labeled examples of each. The model learns the subtle visual differences, the way operators hold them, and the context in which they appear.

From Object Detection to Behavior Recognition

Detecting objects in individual frames was table stakes. The hard problem was recognizing behaviors that unfold over time.

Consider a near-miss between a forklift and a pedestrian. In any single frame, you might see a forklift and a person in the same image. Is that a near-miss or normal traffic? You cannot tell from one frame. You need the sequence. Was the forklift moving toward the pedestrian? Did they get dangerously close? Did either party take evasive action?

Behavior recognition requires temporal models. Systems that process sequences of frames and understand how the scene evolves over time.

Our architecture uses a combination of approaches. Per-frame feature extraction identifies what objects are present and where. Tracking algorithms follow objects across frames, maintaining identity even when they temporarily occlude each other. Sequence models, including temporal convolutional networks and transformer-based attention over frame sequences, learn to classify behaviors based on how objects move relative to each other over time.

Some examples of behaviors our models recognize:

Phone use while operating. The model detects a phone-shaped object in the operator's hand, tracks it across frames, and classifies whether the operator is interacting with it versus holding it passively or putting it away.

Not looking in direction of travel. This requires tracking the operator's head orientation relative to the forklift's direction of motion. The model learns to recognize when the operator is looking sideways or backward while the vehicle moves forward.

Intersection behavior. Did the operator actually stop and look at a blind corner, or did they roll through without checking? The model tracks vehicle speed through the intersection and head orientation of the operator.

Distinguishing impacts from normal contact. Forklifts touch things constantly. Forks slide under pallets. Loads make contact with racking. The model learns to distinguish intentional contact from unexpected impacts based on velocity, deceleration patterns, and subsequent operator behavior.

Each of these requires understanding sequences, not snapshots. A phone in someone's hand is not necessarily a violation. They might be putting it in their pocket. The model has to see what happens over the next few seconds to classify correctly.

Site Adaptation and Transfer Learning

No matter how diverse your training data, a new site will have something you have not seen before. A unique equipment type. An unusual lighting configuration. A layout that creates visual patterns different from anything in your dataset.

This is where transfer learning becomes essential. Instead of training a new model from scratch for each site, we transfer knowledge from the network to new deployments.

The process works like this:

When we deploy to a new site, the model starts with all the knowledge accumulated from the network. It already knows what forklifts look like across dozens of manufacturers. It already understands pedestrian behaviors. It already recognizes phone use, directional awareness, intersection patterns.

During a calibration period, the system runs in observation mode. It processes video, generates classifications, and learns the specific characteristics of this environment. Which zones have unusual lighting. What equipment types are present. How traffic flows through this particular layout.

The calibration is not supervised learning in the traditional sense. We are not collecting thousands of new labels. Instead, we use techniques like domain adaptation and few-shot learning to adjust the model's decision boundaries for the new environment while preserving the core capabilities learned from the network.

This is the fundamental difference between a customer trying to build their own system versus using ours. A customer starting from scratch has zero training data. They would need to collect and label millions of frames before their model worked reliably. We start from a foundation of tens of millions of labeled frames and adapt to new environments with minimal additional data.

The network effects compound over time. Every site we deploy improves the foundation models. When Site 151 encounters an unusual situation, that data can improve performance for Sites 1 through 150 as well. The models get better for everyone as the network grows.

Edge Architecture: Why Processing Location Matters

A common question: why not just stream video to the cloud and process it there? Cloud GPUs are powerful and scalable. Why bother with edge processing?

The answer comes down to latency, bandwidth, and the fundamental requirements of safety systems.

Latency. A safety alert that arrives 30 seconds after a near-miss is a report, not a warning. If we want to enable real-time operator feedback, we need sub-second inference. That means processing has to happen locally. Round-trip latency to cloud data centers, even with good connectivity, adds hundreds of milliseconds to seconds of delay. For safety applications, that is too slow.

Bandwidth. Video is expensive to transmit. A single HD camera generates gigabytes per day. Multiply by multiple cameras per device, multiple devices per site, and you quickly exceed what most warehouse networks can handle. Edge processing reduces bandwidth requirements by orders of magnitude. Instead of streaming raw video, we transmit only events and metadata.

Reliability. Warehouse connectivity is not guaranteed. Networks go down. Internet connections get saturated during peak hours. A safety system that fails when the network is unavailable is not acceptable. Edge processing means the system keeps working regardless of connectivity state.

Our sensors include Neural Processing Units (NPUs), dedicated silicon optimized for machine learning inference. These chips can run our detection and classification models at frame rate while consuming minimal power.

The architecture splits responsibilities. Computationally intensive tasks that must happen in real-time run on the edge. Training, model updates, analytics aggregation, and cross-site comparisons happen in the cloud. The edge devices receive updated model weights periodically, incorporating improvements from across the network.

This hybrid approach gives us the best of both worlds. Real-time local processing for safety-critical functions. Cloud-scale compute for the heavy lifting of model training and improvement.

The Network Effect and Data Flywheel

Here is the competitive reality that makes this business different from typical software.

Most software products have network effects in usage. More users make the product more valuable through direct interactions. Slack gets better when more of your colleagues use it.

AI products have network effects in data. More deployments generate more training data. More training data produces better models. Better models create more value for customers. More value attracts more customers. The flywheel accelerates.

When we started, our models had limited capability. They could detect basic objects but struggled with edge cases. As we deployed to more sites, we collected more labeled examples. The models improved. They could handle more complex behaviors, more diverse environments, more unusual situations.

Today, with tens of millions of labeled frames from over 150 sites, our models can recognize situations that a new entrant has never seen. A startup trying to build competing capability would need years to accumulate equivalent training data, assuming they could even get deployed at enough sites.

This is not just about volume. It is about coverage of the distribution. We have seen forklifts from every major manufacturer. We have seen warehouses in every climate zone. We have seen operations across multiple industries. That breadth means our models are robust to variation in ways that a smaller dataset cannot achieve.

The labeled events are particularly valuable. Any company can point cameras at forklifts. The hard part is knowing which frames contain near-misses, which show phone use, which capture impacts. That labeling requires human review at scale and represents thousands of hours of work. It cannot be replicated overnight.

Trade-offs and Hard Problems

This is not a post claiming we have solved everything. Some problems remain genuinely hard.

Rare events. The behaviors we care most about, serious near-misses and actual injuries, are statistically rare. You cannot train robust classifiers on rare events alone. We use techniques like hard negative mining and data augmentation to address class imbalance, but there is no substitute for eventually collecting enough real examples.

Occlusion and obstruction. Cameras on forklifts see what they see. If a load blocks the view, if another vehicle passes between the camera and an event, we miss it. Multi-camera fusion helps but does not solve the problem completely.

Operator intent. The same action can have different meanings. An operator looking away from the direction of travel might be checking their mirrors (good) or checking their phone (bad). Context matters. We continue improving the models' ability to infer intent from context.

Privacy and acceptance. Cameras in the workplace raise concerns. We have built privacy features into the system, including face blurring and options for local-only processing of certain data. But adoption requires building trust with operators that the system is about safety coaching, not surveillance.

Model interpretability. Neural networks are famously black boxes. When the model makes a classification, it is not always clear why. We use attention visualization and other interpretability techniques to help operators and supervisors understand what the system saw, but this remains an active area of development.

These are engineering challenges, not fundamental blockers. We improve on each of them with every model generation. But they represent honest areas where the technology has room to grow.

What This Enables

The technical foundation we have built enables capabilities that were not possible before.

Automatic detection of safety events across every forklift, every shift, without human review. Leading indicator tracking that predicts which behaviors will cause incidents. Video-based coaching that shows operators exactly what they did and why it matters.

These are not incremental improvements to existing approaches. They represent a different way of understanding warehouse operations. Ground truth data captured automatically, at scale, continuously.

For the first time, operations leaders can see their facilities with the same clarity they see their transactional systems. Not sampled data. Not self-reported data. Actual data about what happens on the floor.

That visibility changes everything. It enables proactive safety management instead of reactive investigation. It enables coaching based on evidence instead of guesswork. It enables continuous improvement driven by measurement instead of intuition.

The AI is the foundation. What you build on top of it is what matters.

If you want to see how this works in practice, request a demo or read how customers are using the platform.


Related Articles