Hand controller for Microsoft HoloLens (Master Thesis)


My Master thesis is a proof of concept on how to use an additional device as 6DOF controller for Microsoft HoloLens (Gen 1). In this case I used the Lenovo Phab 2 Pro, an Android and Google Tango enabled device. The concept can be adopted for arbitrary multi device tracking within a local space. The time frame for the thesis was roughly seven months. Below is a short video of the project result.

HoloLens Handcontroller demo.

In the video you first see the setup stage. The HoloLens already has a scan of the room, as it creates the spatial mapping as part of normal operation. The user starts the scanning process on the phone and moves around the room. A computer in the background receives both ongoing rooms scans and does the map alignment on the fly. The image in the left lower corner shows the respective room scans and the map alignment process. After gathering enough room coverage the user acknowledges the map alignment and ends the setup. Both devices are then ready to use. A small demo at the end shows using the controller for placing and moving cubes in the room, as well as using the phone as a light saber.

My Thesis in detail


The current interaction concept of the Microsoft HoloLens (Gen 1) has certain flaws. The gaze cursor is un-intuitive and constant usage of the tap gesture results in getting a heavy arm. Therefore the idea was born to utilize a hand controller for the HoloLens, like for common VR systems. However current controller systems are insufficient or not compatible with HoloLens. Therefore I developed a new tracking system, that is easy to setup and as independent as the HoloLens.

Tracking System

The tracking system utilizes SLAM. Each device must be a SLAM system themselves, that does pose tracking within a local space and creates a local spatial map. The local spaces of the devices are aligned using their local 3D maps to create a shared
global reference frame. When all devices have a local to global reference frame transformation, each device can transform their pose into a global pose and share it with all other systems. The respective systems can then use the inverse transformation to transform other device poses into their local reference frame.

Map Alignment Process

For the map alignment one device reference frame is defined as the global one. This can be either the first device in the space, or as in my case the HoloLens. The spatial map of the joining device is aligned to the global one in a setup stage, as shown in the video below.

The alignment process for one of my test cases.

The alignment uses a customized Iterative Closest Point (ICP) algorithm. In order to not wait for fully scanned room, the alignment starts right away. The customized algorithm adds consecutive scans to the models between the iteration steps.

The ICP algorithm needs some sort of prealignment. Otherwise it will not converge correctly due to the mostly rectangular shape of rooms. For the prealigniment I intended to use the accelerometer for the gravitational vector and the magnetometer for the north vector. The devices are already doing the former automatically. However the latter turned out to be an issue on the HoloLens. The HoloLens doesn’t grant access to its magnetometer, nor provides an compass API (This thesis was done before the release of RS4 and its research mode). Instead the alignment process makes the assumption that the user has the hand controller in their hand and is looking at it with the HoloLens, when starting the setup process. Therefore it assumes both devices are facing the same direction and have almost the same position in the room.

An alternative, which was considered but discarded, is to use an visual image tracking approach either as prealignment step or for the overall map alignment.

Hardware & Implementation

For the hand controller device a Lenovo Phab 2 Pro is used. This is a consumer grade phone for the now discontinued Google Tango Platform, which features built-in SLAM. The communication between devices is done using MQTT over WiFi. Due to the heavy workload of the alignment computation, an additional PC is part of the setup. This PC also hosts the MQTT broker. The map alignment itself is implemented using Point Cloud Library (PCL). The HoloLens runs a simple Unity app, whereas the Android phone runs a simple Android app, using the Tango SDK.

Due to the proof of concept nature and time scope, neither the communication nor the implementation feature any optimization, but follow a straight forward approach.

Due to the pre-alignment requirement, the setup process is started manually. Further the user has to manually acknowledge and end the setup after the alignment has settled sufficiently. The system is unable to determine whether enough room coverage has been gathered and the alignment is correct, to end the setup process automatically.

Evaluation & Conclusion

The basic concept has been proofed, although with some shortcomings. The phone tracking wasn’t entirely stable, creating drift and loosing tracking for faster movements. Using two different devices, with entirely different hardware, as well as SLAM implementations, their resulting maps and positions have a certain variance in their scale. Because the ICP algorithm only produces rigid transformations, this difference in maps size could introduce additional offsets.

Without any optimizations in place, using the raw maps with their entire point clouds, the implementation achieved an accuracy of 10-20cm in translation and 1,9-2,9° in rotation.

Several tests were conducted to determine the required time, room coverage and scan resolution for a sufficient room alignment result. Further the initial concept of just using a rotational pre-alignment was tested versus using the current assumption featuring a rotational and translational pre-alignment. By removing the translation pre-alignment, the ICP algorithm was unable to correctly align both maps. This indicates that the initial concept of just using the compass orientation, wouldn’t have worked and a translational pre-alignment is a requirement. To test the minimal scan resolution a Voxelgrid filter in different sizes was used to decrease the number of points in the point clouds. Tests showed that a room scan resolution of up 10cm produce sufficient alignment results. The required room coverage depends on the actual ICP implementation as well as some other factors. However tests showed that roughly 50% of the room needs to be covered by the joining device. As for performance, map alignment settled within 20 second on an Core i7-4712HQ (Launch Date Q2’14).


This system is not limited to a single user using multiple devices, but may be also used for multiple users using a wide variety of devices in a local space, including robots, drones and other autonomous agents.

In comparison with direct visual tracking methods, this approach does not require a line of sight between participating devices. Just enough room coverage for its SLAM and sufficient visual features.

Links & Further Readings

Project on GitHub

Leave a Reply

Your email address will not be published.