Franka Vision-Guided Fine Manipulation

Unconstrained millimeter Bogie Alignment

Andnet DeBoer, Derek Dietz, Theo Coulson

Northwestern University

ROS 2 Python Franka

Overview

This project demonstrates precise fine manipulation using a Franka Emika robot arm to manipulate HO-scale model train cars with ±1mm accuracy. The system integrates a robust computer vision pipeline with MoveIt2 motion planning to solve a challenging alignment task: positioning free-spinning train bogies onto model railroad tracks.

The project also establishes a zero-shot data distillation pipeline for training custom object detection models, using the robot itself to autonomously collect and generate training data.

Problem Statement

Aligning model train cars onto tracks requires sub-millimeter precision due to the unconstrained rotation of the bogies—the wheel assemblies can spin freely in any direction when the train is lifted, similar to caster wheels. Traditional pick-and-place approaches fail because:

Bogie orientation is unknown when the gripper approaches the train
Track orientation varies across the layout and must be detected in real-time
Class similarity from top-down view makes distinguishing trains from tracks challenging for vision systems

Solution

End Effector

Our solution uses a custom end effector to physically constrain the bogie to a known rotation, combined with a robust OpenCV pipeline to detect track orientation. The gripper then aligns the constrained wheel assembly with the detected track angle before placement.

System Architecture

                        ┌─────────────┐     ┌─────────────────┐     ┌──────────────────┐
                        │  RealSense  │────▶│  Vision System  │────▶│  Conductor Node  │
                        │  Camera     │     │  (Track + Car)  │     │                  │
                        └─────────────┘     └─────────────────┘     └────────┬─────────┘
                                                                            │
                                                  Target Poses + Gripper States
                                                                            ▼
                        ┌─────────────┐     ┌─────────────────┐     ┌──────────────────┐
                        │  Franka Arm │◀────│  MoveIt2 API    │◀────│       Railer     │
                        │             │     │                 │     │                  │
                        └─────────────┘     └─────────────────┘     └──────────────────┘

Computer Vision Pipeline

Track Detection

A multi-stage OpenCV pipeline processes RGB images from the RealSense camera to detect track orientation:

Preprocessing: Brightness, contrast, and white balance adjustment

Edge Detection: Canny edge detection on enhanced images

Morphological Operations: Dilation and skeletonization to extract rail centerlines

Line Detection: Hough transform to identify track segments

Pose Estimation: Convert 2D track orientation to 3D transforms using depth data

OpenCV Pipeline stages — **Figure:** OpenCV pipeline for rail centerline detection

Rail centerline detection result — **Figure:** OpenCV pipeline for rail centerline detection

Train Detection & Classification

Zero-Shot Data Distillation Pipeline

View on GitHub

Stage	Method	Output
Data Collection	Franka conical scans of each train car → ROS bags	30,000 RGB-D sequences
Frame Extraction	Every 10th frame sampled	~3,000 images
Auto-Labeling	Grounding DINO + SAM2	Bounding boxes (~70% accurate)
Manual Refinement	Human correction	Clean training labels
Model Training	YOLOv8-OBB	Oriented bounding box detection

Training Challenges

The vision system required adversarial training to handle edge cases:

Tracks misclassified as trains (similar dark, elongated shapes)
Trains misclassified as tracks (especially from top-down view)
Significant visual similarity between classes when viewed from above

Model Architecture

model = YOLO('yolov8n-obb.pt')

results = model.train(
    data='dataset.yaml',
    epochs=60,
    imgsz=640,
    batch=16,
    device=0,
    name='augmented_model',
    mosaic=1.0,
    copy_paste=0.4,
    degrees=10,
    translate=0.1,
    scale=0.5,
    shear=2,
)

Results

mAP50: 0.95+
mAP50-95: 0.85+
Precision: 0.92+
Recall: 0.90+

Train Car Classes

The system is capable of recognizing 12 distinct train car types and 2 switches

Key Features

Oriented Bounding Boxes (OBB)

Standard axis-aligned bounding boxes are insufficient for rotated objects. We use oriented bounding boxes that include rotation angle, enabling:

More accurate object localization
Direct extraction of train orientation for gripper alignment
Better handling of diagonal track sections

# Extract OBB from detection
center, (width, height), angle = cv2.minAreaRect(contour)

Rail Rejection

To prevent false positives where track sections are detected as trains:

Aspect ratio filtering (trains have characteristic length/width ratios)
Context-aware rejection (objects on tracks vs. beside tracks)
Multi-frame temporal consistency

Train Centering

Precise centroid calculation using SAM2 segmentation masks:

Generate instance segmentation mask
Calculate mask centroid
Project to 3D using depth alignment
Publish as TF transform for motion planning

Hardware

Robot: Franka Emika Panda 7-DOF arm
Camera: Intel RealSense D435 (RGB + Depth)
End Effector: Custom 3D-printed gripper with bogie constraint mechanism
Trains: HO-scale (1:87) model railroad cars

Software Stack

ROS 2 Kilted
MoveIt2 - Motion planning
OpenCV - Image processing
Ultralytics YOLOv8 - Object detection
Grounding DINO - Open-vocabulary detection
SAM2 - Instance segmentation