Skip to content
acolomitchi edited this page Dec 18, 2016 · 6 revisions

How it looks at a glance

overview

  1. leftmost pane: the distribution(s) of distances:
    • in green - the "baseline" distribution - taken from 4000 points uniform-randomly distributed in the unit square - by trial, all the distances between 4000 points seems to result in a smooth-enough-for-the-purpose distribution
    • in blue, the distribution of distances between the "experimental" points (the ones in the middle pane)
    • in red, the difference between the "experimental" distribution and the "baseline one"
    • underneath, two (thin) progress bars showing the completion percentage of their respective distribution. A change in the "experimental" point set will trigger the re-computation for the "experimental" distribution and the progress will be shown by the blue progress bar. The same would go for the "baseline" one, except that the change of the "baseline point set" happens only in special circumstances (*maybe* explained later).
  2. mid pane - the "experimental point cloud positioning/distorting area". Note the presence of the "distortion hull" for the light-blue cluster of points in the lower right: dragging it's corners will distort the hull and, implicitly, the position of the points in the "cluster under edit"; use the mid-sides knobs on the hull to rotate it around the baricentre and the baricentre itself to translate the cluster in another position. (careful with the last one: you drop it outside the editing square, you won't be able to bring it back agaib - maybe in the future I'll look into this; if you so dropped your cluster, the solution is to delete and recreate it)
  3. right pane - "point collection editor" - you will be able to add/remove point clusters and specify their number/attributes, along with their corresponding "Adjust" checkbox - when ticked, this checkbox will cause the "distortion hull" become visible around the

If time will allow (and there'd be any interests from others), <<THIS will be replaced by a link to the user guide>>

Some notes

How many distances are enough?

The distribution of distances between two random points in a box (not necessary unit-box) is known as close formula up to three dimensions while at 4 and 5 dimensions (and, I assume, higher) don't.
Since the base code should work in higher dimensions without change, I went of the path of generating the "baseline" distribution the same way as the rest: by "experimentally" extracting the distance distribution of a random-uniform distribution of points within the (potentially n-dim) box.
But then... how many points in the "baseline point cluster" would be enough? Actually, the better question is "how many randomly chosen distances would result in a histogram close enough to the theoretical one"? Now, for the purpose of "exploring", getting an exact value (based on a "close enough" precision) would be an overkill, therefore I relied on the "eye test" - how many distances to consider to have a distance distribution which is smooth enough. Around 16·106 seems to be a maximal value for good enough (at least for 2D), the 4·106 is acceptable. The application allows the user to consider if s/he wants to compute the exhaustive set of all distances or only a certain count of them - there the checkbox/spinbox group in the upper-right.

Below, the screen captures for the exhaustive vs 16·6 for the about 24000 points (in 12 clusters) shown by the mid-panel:

  • All distances considered

    exhaustive set of distances

  • Only 16M distances considered for the distribution

    16M random distances

Examples of distributions

When I started this toy project, I sort of knew that with 2 clusters the distribution of distances is going to show a peak corresponding the the intra-cluster distances and another one corresponding to inter-cluster distances, but I couldn't imagine exactly what the distance distribution will show in between or how it would look for more than 2 clusters. This was the very purpose of putting together this application. For any others curious enough to be interested but not to the point of compiling/running the app, some screen captures

no cluster 1 cluster ND (normal distributed)
2 clusters ND 3 clusters ND
4 clusters ND 5 cluster ND
6 clusters ND 7 cluster ND

Some words on the design

I wanted some code I can reuse in other (than Qt/2D) contexts, just in case the exploration turns something interesting and I'd need to get explore it further on non-GUI conditions.
The src/model contains the (templated) classes to do it. Notable:

  1. the src/model/model.hpp classes:
  • npoint<CoordType, Dim> - in src/model/model.hpp. A N-dim point with coordinates expressed as CoordType, in fact just a typedef for Eigen::template Matrix<C, 1, DIM, Eigen::RowMajor> (vector arithmetic already available, plus some other interesting ops I'll mention later - if I won't forget)
  • npoint_grp<C, DIM> and npoint_grp<C, DIM, Transform> - two templated classes to support the concept of "cluster of points". The first one is just a storage similar with std::vector<npoint<...>>`, the second one will offers support for "transformation on-the-fly" (stores the points as a vector, but when asked for one, offers a transformed copy of the requested point)
  • bound_npoint_cloud<class Supplier, typename CoordType, DIM> a pure virtual collection of "point clusters" (with type designated by class Supplier template parameter) with the (pure virtual) within_bounds method to tell if ant point supplied by the stored Supplier instances are acceptable. The way to obtain the points is only by-copy, using the void points_copy(Container& dest) const method.
    Most of the time, the Suppliers managed by this class will be instances of npoint_grp<C, DIM, ...> (or derived) but any other class implementing the requirements (briefly) documented in the class comments will simply do the job (compile time polymorphism)
  • the bbox_npoint_cloud<Supplier, CoordType, DIM> is just a bound_npoint_cloud<class Supplier, typename CoordType, DIM> with the bounds defined by a rectangular N-dim box;
  • the histogram<CoordType> an abstract class modelling a histogram, with the derived fixedl_histogram<CoordType> implementing support for histograms with lower/upper bounds known at creation type (fixed limits histogram);
  1. src/model/dists.hpp - ah, yes, another example of compile-time polymorphism. Apart from the l2 distance (providing the euclidian norm distance) there's also a support for Mahalanobis distance. The main difference between the two: while l2 is independent of any context to do the computations, the mahalanobis class will need to be initialized with the reference set of points to do the job, with the void update(const Supplier* src) method needing to be called before the operator()(npoint..., npoint...) can return a proper distance. As l2 does not provide a update(const Supplier* src) method (not even a do-nothing one), the job of "updating" a distance is let to… (see next point)
  2. the src/model/proc.hpp classes
  • … the detail::struct dist_type_updater<SupplierType, DistType> is the class that handles (at compile type) the update(const SupplierType* src) action of updating the DistType based on the presence of absence of the update(...) method. The code that needs the services of a "distance" class only needs to call one of the dist_type_updater::update_dist(const Supplier& , dist_type&) or dist_type_updater::update_dist(const Supplier* , dist_type&) - if the DistType class declares the update(const SupplierType* src) method, it will be invoked appropriately, if not the call into dist_type_updater::update_dist will result in a noop (kicked our at any non-trivial code optimisation);
  • the class histogram_filler<typename CoordType, size_t DIM, class PointSupplier, class Observer> is the class that does the heavy lifting of filling in a histogram on a separate thread. The class provides a start(...) method (which will kick-of the job of filling in a histogram) and a stop() method, which will make sure that the worker thread is stopped before finishing the job (if the job of filling the histogram is not complete). As, during the histogram filling, the caller may want to be notified of the progress and have partial results provided, the start(...) will take an observable parameter. The details are a bit more complicated, so maybe I'll cover them separately later.
  1. the src/view classes - majority of them are just organizing the GUI flow. The interesting ones are:
  • file view/2d.hpp - the qtrn_adaptor a class that packs/adapt a QTransform in a way the npoint_grp<C, DIM, Transform> can consume. It is the way the "distortion hull" works on the "clusters positioning/distortion area" (mid pane of the GUI) acts on the points in a cluster
  • files view/PointCluster[.hpp/.cpp] - the PointCluster class that wraps/adapts a npoint_grp<C, DIM, Transform> to the Qt slot/signals
  • files view/cloudmodel[.hpp/.cpp] - the CloudModel class adapts a bbox_npoint_cloud<Supplier, CoordType, DIM> to the Qt slot/signals. It also acts as the "Model" of the GUI part (Model as the M in MVC).