RecSys for MAD: an empirical study

RecSys

2.1 Experimental setup

As said in our previous post, our intent is to apply these methods to the MAD.

Our real-world ground to experiment on is a dataset of about 1.7 million executed trades in a period of 3 months (from 15-Aug-2019 to 15-Nov-2019). It includes a total of more than 200 distinct subjects dealing with a total of more than 5k distinct securities traded within more than 50 markets. Below an extraction of few records from such dataset just to give the idea.

SUBJECT	ISIN	QTY	PRICE	CNTR'VAL	CURR	TIMESTAMP
Executed contracts dataset example
1007120	JP3672400003	1700.0	648.200	1101940.00	JPY	2019-08-15T00:03:00.061000Z
1039910	AU000000ANO7	579.0	5.300	3068.70	AUD	2019-08-15T00:05:22.805000Z
1039910	AU000000ANO7	392.0	5.300	2077.60	AUD	2019-08-15T00:09:06.733000Z
1044976	AU000000STM0	250000.0	0.030	7500.00	AUD	2019-08-15T00:14:34.038000Z
1044976	AU000000BOE4	100000.0	0.056	5600.00	AUD	2019-08-15T00:14:40.368000Z
1043883	GB00B03MLX29	30.0	25.100	753.00	EUR	2019-08-15T07:00:00.870623Z
1021212	FR0000124141	16.0	21.700	347.20	EUR	2019-08-15T07:00:01.731558Z
1021212	FR0000124141	10.0	21.700	217.00	EUR	2019-08-15T07:00:01.731558Z
1021212	FR0000124141	84.0	21.700	1822.80	EUR	2019-08-15T07:00:01.731558Z
1021212	FR0000124141	338.0	21.700	7334.60	EUR	2019-08-15T07:00:01.731558Z
1021212	FR0000124141	126.0	21.700	2734.20	EUR	2019-08-15T07:00:01.731558Z
1021212	FR0000124141	50.0	21.700	1085.00	EUR	2019-08-15T07:00:01.731558Z
1021212	FR0000124141	95.0	21.700	2061.50	EUR	2019-08-15T07:00:01.731558Z
1021212	FR0000124141	126.0	21.700	2734.20	EUR	2019-08-15T07:00:01.731558Z
1056156	DE000UNSE018	4.0	26.560	106.24	EUR	2019-08-15T07:00:02.730000Z
1043883	FR0000131104	48.0	39.690	1905.12	EUR	2019-08-15T07:00:03.204824Z

The way we imagine the final product should work is as follows: a definite time window of past data is fed into the RecSys so that it can calibrate and make its own USERs and ITEMs representations. Future operations are then submitted to the calibrated RecSys: for each pair (USER, ITEM) that correspond to each occurring trade, the RecSys will return a score that can be interpreted as an affinity value or, more interesting for our purpose, as an anomaly score.

2.2 Hyperparameter selection

In order to find a suitable setup for the anomaly detection tool we were looking for, we had to proceed on two quite different levels which are common in machine learning and are usually called hyperparameter selection and model training.

Hyperparameter selection revolves around choosing and often engeneering from the dataset at our disposal those features which a RecSys model is built upon, namely the USERs, the ITEMs and the INTERACTIONs.

2.2.1 USER

The market player is by its very nature the preferred option for the USER, while several choices are available for the ITEMs, allowing one to set up different RecSys, each focusing on a different trait of the USER’s behaviour.

2.2.2 ITEM

A natural choice for the ITEM would be the ISIN, which is the unique code identifying a financial instrument. This would allow to sort out securities and subjects based on their mutual interactions. An alternative choice for the ITEM could be the countervalue of the executed order, thus making the model aware of the typical volume range of exercise of each USER. In this particular case where the selected dimension is a continuous quantity, one has to make use of a bucketing procedure in order to convert it into a categorical attribute belonging to a delimited and discrete list of items.

Elaborating further, one may construct new dimensions by combining the existing ones, aiming at discovering patterns in subjects’ behaviour that correlates over different dimensions. For example one may join the ISIN with some bucketing of the countervalue in order to explore subjects’ attitude where some kind of securities are traded in low volumes and/or in large amount, while other securities are traded in high volumes and/or in small amount.

2.2.3 INTERACTION

For each choice of the ITEM dimension, an appropriate metric needs to be chosen as the INTERACTION of the RecSys model. This can be a field or an engineered feature of the dataset that will be regarded as a sort of rating a given USER ascribes to a given ITEM. Here a non-categorical, quantitative dimension is favoured, so that a larger (smaller) value can be associated with a larger (smaller) rating. In our case a natural choice is the countervalue, since a larger volume of purchases clearly express a greater valuation for a financial instrument. In any case, even if a direct quantitative dimension is not available in the original data, a straightforward procedure to get a measure of rating can be achieved by just counting the occurrences of records in the original flat database when pivoting it through a group-by operation using the USER and the ITEM dimensions as a double-keys index. Well, in fact, even if you are willing to use as rating a direct quantitative dimension belonging to the original database, you still have to aggregate such values after the group-by since the RecSys needs a single rating score for a given (USER, ITEM) pair.

Actually the specific aggregation function used to build the rating can be quite critical on the effectiveness of the RecSys ability to capture the patterns in the users’ behaviours. For example, by using a simple aggregation function like count (or sum for a direct quantitative dimension) may results in a disproportional characterization of the population since there are very few USERs that make by far too many transactions (or that exchanges by far too much countervalue) with respect to the others.

A naif improvement could be to normalize such rating dividing it by the total count (or total sum of the countervalue). However this could still not be suitable enough: each of the few securities traded by a subject with a very small activity would result in a high rating for her, while on the contrary the many securities traded by a subject with a broad and scattered activity would each result in a very small rating for him.

It turned out that a better approach is to use the max, instead of the total, over each USER, for the normalization of the aggregation function. In this way, for example, if a sporadic subject traded an equal amount (either of countervalue or of number of deals) of a few securities, they will result in an equal rating for him, regardless of the total amount of activity. In a similar fashion, if a very active subject traded mostly and with a similar amount many different securities, they will result for her in a similar rating value not deflated by its own widespread presence.