Una vita di Devianze

The Mirai malware & IoT security

Currently, I am working for a telecom company specialized in IoT devices. Essentially, we provide cellular communication options for IoT projects worldwide. I interact daily with customers deploying their devices in a variety of fields: vehicle tracking, energy monitoring and provisioning, security and industrial applications. While their applications are all different, I noticed that one thing never changes: the security threats to those devices. Regularly, every month, I have to deal with customers whose devices have been infected by malwares. While the damage to their operation is usually minimal, the data communication charges are not, leading to significant monetary consequences.

Given the prevalence of these situations, I thought I would dig deeper into the world of IoT security, to understand why these excess charges are happening. During this quest, I discovered the source of IoT malwares: Mirai, a botnet developed in 2016. Therefore, I used the opportunity of this paper to study Mirai, how it works and how we can defend from it.

Mirai Architecture

In the fall of 2016, multiple high-profile websites (e.g. Netflix, GitHub, Reddit) found themselves rendered inaccessible by a very powerful DDoS attack, estimated, by some accounts, to have reached 1.2 Tbps (The Economist, 2016), an unprecedented size.

After that attack, the source code and instructions of a new IoT malware named Mirai was release as open-source on a hacking forum. Since then, multiple derivative malwares have been created that are now bringing into spotlight the multiple security issues of the IoT world.

Mirai is a malware engineered to perform DDoS attacks (Distributed Denial of Service). The aim of those attacks is to overwhelm the target server by flooding it with superfluous requests, in order to interrupt its services and prevent legitimate clients to acces it (C. Douligeris, 2004). To achieve this aim, Mirai discovers, infects and controls unprotected (or loosely protected) IoT devices connected to the internet, to use them as bots and coordinating them to carry DDoS operations (Margolis, et al., 2017).

Its architecture is shown in Diagram 1. Initially, the Command-and-Control (C&C) servers runs two socket listeners: one on port 23 for Telent connections and one on port 101 for programmatic API. The C&C server is written in Go.

Depending on the type of data sent to the Telnet socket (either a 4-byte integer or something else), an Admin Handler or Bot Handler are created. The Admin handler is an interactive prompt that allows users to manage bots and attacks (all data is stored in a MySQL DB); on the contrary, a Bot Handler handles the connection of a new bot. If the former successfully communicates with the latter, a new bot is created and its ID added to the above database.

The API socket, instead, has only one function: listening for orders of attacks. The syntax of these orders is as follows:

-n <attack type> <1.1.1.1[/8][,] …>

<attack length> [flag=val …]

The -n controls the number of bots to use, the IP can be provided either as a single value or as a range, while the attack length defines the duration of the attack in seconds (from 0 to 3600). The attack type, chosen from 10 different types, defines the “flavor” of the malicious operation.

To infect a new device, each bot randomly selects and IP and then checks it against a table of forbidden addresses; to note, the IPs of the US Postal Service and of the Hewlett-Packard Corp. are expressly protected from the attacks. If the IP is valid, the bot will then launch a dictionary attack using a list of 60 factory default login credentials that can be seen in Table 1.

After a new device is found, its IP and login credentials are sent to the loader which will contact the device, download and load a binary of the malware. This new bot will then start to infect new devices, creating a cycle that allows Mirai to grow its networks of bot in an exponential fashion.

It is exactly this ability to easily infect devices that makes Mirai a formidable threat. While the types of attacks it can perform are nothing new, its worming capabilities are exceptionally dangerous. In addition, the source code of Mirai has been released as open source since 2017 (Gamblin, 2017); consequently, many derivatives malware have been developed since then.

Username	Password	Username	Password	Username	Password	Username	Password
666666	666666	root	7ujMko0admin	admin	smcadmin	root	realtek
888888	888888	root	7ujMko0vizxv	admin1	password	root	root
admin	(none)	root	888888	administrator	1234	root	system
admin	1111	root	admin	Administrator	admin	root	user
admin	1111111	root	anko	guest	12345	root	vizxv
admin	1234	root	default	guest	guest	root	xc3511
admin	12345	root	dreambox	mother	f****r	root	xmhdipc
admin	123456	root	hi3518	root	(none)	root	zlxx.
admin	54321	root	ikwb	root	0	root	Zte521
admin	7ujMko0admin	root	juantech	root	1111	service	service
admin	admin	root	jvbzd	root	1234	supervisor	supervisor
admin	admin1234	root	klv123	root	12345	support	support
admin	meinsm	root	klv1234	root	123456	tech	tech
admin	pass	root	pass	root	54321	ubnt	ubnt
admin	password	root	password	root	666666	user	user

Table 1: the factory default usernames & passwords used by Mirai

Defensive approaches

Mitigating the risks posed by Mirai is somewhat difficult in that it uses legitimate credentials to gain access to devices which are usually not scanned for malicious activity. Nevertheless, there are some important defensive actions that can significantly reduce the risk of infection:

Change device credentials: after the deployment of the device, the user must change the access credentials for SSH or Telnet access. It is important to note that the credentials that can be changed via the device GUI are not always the credentials used for SSH or Telent connections.
Closed unused ports: as described before, ports 22 and 23 are the ones exploited by Mirai to gain access. Hence, they should never be publicly accessible. Moreover, they should be blocked at router level to avoid any access to internal devices.
Monitor ioctl: Mirai (and its derivatives) have the common pattern of sending an ioctl request to the watchdog timer. This is done to prevent the watchdog to restart unexpectedly. Given that the watchdog is a fundamental fail-safe system for Linux IoT devices (Weingel, 2007), its disabling should never occur. Hence, monitoring its activity is a very important step to identify the presence of Mirai.
White-hat script: an automated white-hat penetration script can be used to identify vulnerable devices in the network. Once they are found, the script should attempt login in a manner similar to Mirai. This method could help in identifying exposed devices in very large network, where the risk of human error is greater.

Other security considerations for IoT

During my research, I also encountered many other challenges, other than malware, in securing an IoT project (Zhang, et al., 2014):

Identifying objects in the network: until now, most IoT application used the DNS system to identify (and name) objects in the network. However, such system is still vulnerable to cache poisoning & man-in-the-middle attacks, which can inject fake DNS records in the target cache.
Authentication & authorization: while many public-key (such as the ones studied during this course) provide a theoretically sound system from authentication & authorization, the absence of a global root Certificate Authority prevents those crypto-systems to be effective. In addition, it may prove impractical to provide a certificate to each IoT object given the sheer number of devices.
Cryptosystems & security protocols: public-key cryptosystems are highly desirable because they generally provide advanced security features. However, they are very often inoperable on IoT systems, given the resource constraints that these objects have.

In addition, (Zarca, et al., 2018) also point out that Software Defined Networks (SDN) could help in devising new defensive approaches. According to the authors, SDN have a number of defensive benefits:

Dynamic Flow Control: by decoupling the data from the network plane, there is now the possibility to enable dynamic access control functions depending on specified privileges and policies
Traffic Isolation: which allows to flexibly isolate compromised sections of the network
Network-wide visibility and monitoring: given that the SDN traffic is managed by a centralized controller, there is now the possibility of monitoring the traffic peak generated by compromised devices in the network.

Works Cited

The Economist, 2016. The internet of stings. [Online]
Available at: https://www.economist.com/science-and-technology/2016/10/08/the-internet-of-stings
[Accessed 3 February 2021].

C. Douligeris, A. M., 2004. DDoS attacks and defense mechanisms: classification and state-of-the-art.Computer Networks, 5 April, pp. 643-666.

Gamblin, J., 2017. GitHub Mirai-Source-Code. [Online]
Available at: https://github.com/jgamblin/Mirai-Source-Code
[Accessed 06 February 2021].

Weingel, C., 2007. The Linux Watchdog driver API. [Online]
Available at: https://www.kernel.org/doc/html/latest/watchdog/watchdog-api.html
[Accessed 6 February 2021].

Margolis, J. et al., 2017. An In-Depth Analysis of the Mirai Botnet. Altoona, PA,, International Conference on Software Security and Assurance (ICSSA).

Zhang, Z., Cho, M. C. Y., Wang, C. & Hsu, C., 2014. IoT Security: Ongoing Challenges and Research Opportunities. Matsue, Japan, IEEE 7th International Conference on Service-Oriented Computing and Applications.

Zarca, A. M., Bernabe, J. B., Farris, I. & Khettab, Y., 2018. Enhancing IoT security through network softwarizationand virtual security appliances. International Journal of Network Management, 28(5).

On data science and IoT

Since several days have been reflecting on the deep connection between data science (or what we refer to predictive modeling) and IoT. IoT is commonly defined as (Rouse, 2019):

a network of interconnected computing devices, mechanical actuators and sensors able to exchange data between themselves without the need of human interaction.

It’s clear to me that the connection between this new network of things and data science is striking; In fact, I strongly believe the real revolution will come when these two branches of technology will finally be recognized as deeply related. I imagine a future where the data collected from the sensors will be transformed into insight and information by the machine learning algorithms and will automatically trigger a response in the physical world thanks to the physical actuators always connected on the Internet.

Until now, data science has mostly focused on social network-generated data or Internet generated data (e.g. pictures, text mining on Twitter, etc); the insights that can be gathered from this kind of data is indeed limited in scope because no physical reaction can be triggered; or better, no improvement in efficiency can be triggered by using such data. On the contrary the data generated by the IoT world will pertain to the physical realm: think for example at the footfall in the city or the numbers of/the type of nutrient required by a crop field. All this data will be transmitted automatically and instantaneously over the Internet to algorithms able to predict and decide what to do based. This in turn will trigger a mechanical or chemical action inducing a response that is predetermined by humans using Machine Learning.

It is clear to me that the connection of the two technologies will be very important for humanity at large and it will be a multiplier of human capabilities in almost all fields of the physical realm.

Richieste simultanee con Python

Oggi ho dovuto testare un API, inviando richieste multiple con Python. Ovviamente, le richieste non devono essere sequenziali ma simultanee. Ossia, un semplice for loop non basta.

Ergo, ho trovato una soluzione usando molteplici processori. Utilizzare un sistema come Amazon Web Services EC2 o SageMaker fa ovviamente la differenza, perché permette di aumentare i cores a disposizione.

Ecco dunque un esempio della soluzione che ho scelto:

import requests
from concurrent.futures import ThreadPoolExecutor

def get_url(url):
return requests.get(url)

list_of_urls = ["https://postman-echo.com/get?foo1=bar1&foo2=bar2"]*10

with ThreadPoolExecutor(max_workers=10) as pool:
response_list = list(pool.map(get_url,list_of_urls))

for response in response_list:
print(response)

Selezioni ibride su un DataFrame pandas multi-indice

La selezioni di viste da un dataframe è cosa basilare per un Data Scientist. L’altro giorno al lavoro ho però incontrato una situazione spinosa:

selezionare dati da un dataframe multi-indice con maschera booleana e con specifico valore in indice.

La soluzione è il metodo pandas.DataFrame.query()

Vediamo un esempio pratico. Il nostro input è:

import pandas as pd
index = pd.MultiIndex.from_product([['xy1','xy2','xy3'], ['1','2','3','4','5']], names=['Tag', 'Page'])
df = pd.DataFrame([1,1,1,4,5,1,1,61,4,51,1,1,4,5,1], index, columns=['Value'])

mentre ciò che vogliamo ottenere è questo:

index2 = pd.MultiIndex.from_product([['1','2','3']], names=['Page'])
df2 = pd.DataFrame([1,1,1], index2, columns=['Value'])
df2

Usando pandas.DataFrame.query() possiamo risolvere il dilemma in molto molto pythoniano:

df1 = df.query("Tag == 'xy1' & Value == 1").reset_index(level=0, drop=True)

(il .reset_index() è utilizzato per resettare l’indice e ottenere il risultato cosmetico voluto)

Forecasting di vendite: serie storiche e ML

Predirre le vendite è il sogno più bagnato di ogni manager che si rispetti. Difatti, avere stime affidabili sull’andamento del mercato è la chiave del budgeting: più le stime sono corrette, più la programmazione diventa facile ed effettiva.

Come in ogni problema di predizione, la variabile che influisce di più sulla bontà del modello è l’informazione disponibile. Nella mia esperienza lavorativa, la quasi totalità delle aziende ha (per ora) una cultura del dato molto scarsa, se non nulla. Il data scientist si troverà dunque a dover lavorare con dati parziali, da reclamare tramite molteplici fonti, la cui affidabilità sarà semplicemente dubbia. Ovviamente, tale ostacolo è prettamente individuale e varierà da scenario a scenario: è dunque fuori dallo scopo di questo post.

Quello di cui voglio veramente parlare è come approcciare the algoritmic side of thigs . Ordunque, vi presento qui uno scoppiettante programmino per affrontare il forecasting delle serie storiche come dei veri pro.

Le basi

Il problema può essere affrontato in base a due variabili interdipendenti:

La finestra di predizione
Il lag delle variabili

La finestra di predizione rappresenta con quanto anticipo vogliamo predirre Y (e.g. se la predizione ha 3 mesi di anticipo, la finestra sarà di 3 mesi). Il lag delle variabili rappresenta invece la relazione temporale con cui le variabili indipendenti influenzano la dipendente. Ad esempio: le vendite di un’industria metallurgica saranno influenzate dall’andamento del PIL globale. Ma questa relazione non sarà immediata temporalmente: l’aumento di domanda del metallo sarà successivo di x mesi all’aumento del PIL.

Mentre la finestra di predizione è decisa dal Data Scientist e dal management, il lag delle variabile è fondamentalmente ignoto. Per trovarlo, in maniera ottimale, si ricorre alla Cross Validation. L’approccio che io propongo è univariato.

Restate con me, così vi spiego le funzioni per questo scopo.

L’oracolo algoritmico

Il programma è costituito da tre funzioni:

laggedDataMat: lagga il dataset (ossia “sposta” le variabili di un determinato lag)
bestLagRegr: trova il lag ottimale (univariato) per ogni variabile
modelSelection: prende l’outup di bestLagRegr ed esegue una grid search per trovare l’ottimale modello predittivo

Le librerie richieste per l’esecuzione sono:

import numpy as np, pandas as pd, pickle, copy
from sklearn.model_selection import cross_val_score, TimeSeriesSplit, GridSearchCV
from sklearn import ensemble
from sklearn.ensemble import GradientBoostingRegressor

Ecco qui di seguito le funzioni in Python:

laggedDataMat

def bestLagRegr(dataMat, maxLag, minLag, yName):

    # Purpose: identify the best lag for variables to be used in regression. Uses CrossValidation
    # (folds TimesSeriesSplit)

    # data - a panda DF with named columns
    # maxLag - the maximum lag to be tested. Integer
    # minLag - the minimum lag to be tested. Integer
    # yName - the name of the column of 'data' containing the dependent var. String

    data = dataMat.copy()
    colnames = [y for y in data.columns if y != yName]
    lags = range(minLag, maxLag)
    folds = TimeSeriesSplit(n_splits=int(np.round(data.shape[0] / 10, 0)))

    results = {}

    for col in colnames:

        scores = []
        lags_list = []

        for l in lags:
            varname = col + str('_lag') + str(l)
            data[varname] = data[col].shift(l)
            YX = data[[yName, varname]]
            YX = YX.dropna().as_matrix()

            # Build regressor and estimate metric of prediction performance with CV
            regr = ensemble.GradientBoostingRegressor(learning_rate=0.01, max_depth=1, n_estimators=500)
            perform = cross_val_score(regr, X=YX[:, 1].reshape(-1, 1), y=YX[:, 0], cv=folds,
                                      scoring='neg_median_absolute_error')

            # Store scores result and lags
            scores.append(np.median(perform))
            lags_list.append(l)

        # Calculate best score, corresponding best lag and store it in a dictionary, containing all colnames
        best_score = max(scores)
        best_lag = lags_list[scores.index(best_score)]
        results[col] = [best_lag, best_score]

    return(results)

bestLagRegr

def laggedDataMat(dataMat, yName, lagsDict):

    # Purpose: build a lagged DF
    # dataMat: the unlagged DF with named columns
    # yName: name of the dependent var. String
    # lagsDict: dictionary produced with 'bestLagRegr, containing:
    #                                                               - keys: column names of dataMat
    #                                                               - elements: lists with lag, CV score
    #
    # Output: a panda DataFrame with columns order sorted alphabetically

    # Initialize empty DF
    df = pd.DataFrame(index=dataMat.index)
    # Set dependent var
    df[yName] = dataMat[[yName]]

    # Creating and adding the lagged vars
    for colName in lagsDict.keys():
        l = lagsDict[colName][0]
        colNameLag = colName + str('_lag') + str(l)
        df[colNameLag] = dataMat[[colName]].shift(l)

    df = df.sort_index(axis=1)

    return(df)

modelSelection

def modelSelection(maxLag, data, depVar, toSave, pSpace = None, alpha = 0.95):

    if pSpace is None:
        pSpace = dict(n_estimators=list(range(5, 2000, 10)),
                      learning_rate=list(np.arange(0.001, 1, 0.1)),
                      max_depth=list(range(1, 3, 1)))

    lags = range(1, maxLag)
    results = dict()

    for lagMin in lags:
        print('Esimating model for lag: ', lagMin)
        lagAnalysis = bestLagRegr(data, maxLag, lagMin, depVar)
        lagMat = laggedDataMat(data, depVar, lagAnalysis)
        lagMat = lagMat.dropna()
        trainY = np.ravel(lagMat[depVar])
        lagMat = lagMat.drop([depVar], 1)

        folds = TimeSeriesSplit(n_splits=int(round(lagMat.shape[0] / 10, 0)))

        model = GradientBoostingRegressor(loss='ls')
        regr = GridSearchCV(estimator=model, param_grid=pSpace, scoring='neg_mean_squared_error', cv=folds)
        regr.fit(lagMat, trainY)

        modelName = toSave + '/' + 'modelOI' + '_lag' + str(lagMin) + '.sav'
        pickle.dump(regr.best_estimator_, open(modelName, 'wb'))
        temp = dict(BestModel=regr.best_estimator_, Score=regr.best_score_, Lags=lagAnalysis)

        regrQUpper = copy.deepcopy(regr.best_estimator_)
        regrQUpper.set_params(loss='quantile', alpha=alpha)
        regrQUpper.fit(lagMat, trainY)
        temp['QUpper'] = regrQUpper

        regrQLower = copy.deepcopy(regr.best_estimator_)
        regrQLower.set_params(loss='quantile', alpha=(1-alpha))
        regrQLower.fit(lagMat, trainY)
        temp['QLower'] = regrQLower

        key = 'Lag ' + str(lagMin)
        results[key] = temp

    return(results)

Calcolo stock minimi con R

Stimare lo stock minimo delle materie prime è una procedura difficile e suscettibile ad errori, soprattutto se eseguita con metodi euristici.

Per fortuna, la statistica ci salva ancora una volta, assicurandoci un livello ottimale di stock e massimizando i risparmi dell’inventario.

Il problema

Come si presenta il problema? Abbiamo 3 variabili da considerare:

Tempi di riordino (in questo esempio, 10 giorni di lead time)
Il consumo storico
Il Lean Manager che ti rincorre nel panico

Per l’esempio di questo post, il prodotto da stoccare saranno fogli di lamiera (misuarti in Kilogrammi), con un tempo di riordino di 10 giorni.

Exploratory Data analysis

*Fig. 1* Grafico consumo lamiera per la variabile temporale. La stagionalità è marcata

Un semplice grafico (Fig. 1) del consumo lamiera contro le decine dei giorni ci presenta una stagionalità abbastanza marcata: sarebbe dunque opportuno ottimizzare seguendo la variabile temporale, in modo da risparmiare sull’inventario.

*Fig. 2* Istogramma del consumo lamiera. L’approssimazione a distribuzione Gamma è chiaramente visibile.

Se eseguiamo un istogramma del consumo lamiera, la distribuzione è inoltre chiaramente approsimabile da una Gamma.
Siamo fortunati: possiamo già azzardare un modello.

Il modello

Come creare una stima per gli stock minimi? Un problema simile è facilmente risolvibile con il concetto di intervallo di confidenza (IC).

Possiamo quindi calcolare il consumo medio di lamiera data la variabile temporale, ossia la soluzione al problema risulta costruire un intervallo di confidenza (IC) intorno a:

$E[Y|(X,\omega)] = \mu_{Y|(X, \omega)}$

Dove:

Y è il consumo di metallo
X è la matrice delle variabili indipendenti (i.e. l’indice del periodo di 10 giorni)
ω è il codice prodotto (in questo caso solo un tipo di lamiera)
μ indica la media

Per stimare questa media, è necessario fittare un GLM con famiglia Gamma e link-function logaritmica:

$\Hat{y}_{i,k} = e^{\beta_{0} + \beta_{1}x_{i} + \beta_{2}\omega_{k}}$

Ricordo al caro lettore che a livello della link-function, la realtà è Normale. Dato ciò, costruire un IC α è banale: bisogna solamente eseguire il classico calcolo con la distribuzione t (perchè la dev std della popolazione non è nota, mascalzone!)

Il GLM (e il così costruito IC), ci stimano il consumo medio per periodo: dunque, il limite massimo di tale IC risulta essere la scelta migliore per lo stock minimo.

Ovviamente, la decisione del Lean Manager risulta ora concentrata sulla scelta del parametro α.

Ed ecco a voi il risultato:

Stock minimi lamiera: il livello minimo suggerito è rappresentato dalla linea rossa (95% di confidenza), mentre i punti verdi sono il consumo medio per periodo.

Per una trattazione più matematicamente rigorosa del problema, potete riferirvi al mio GitHub, dove troverete la documentazione del progetto. Qui in calce, invece, troverete il codice usato.

set.seed(123)


# Prototyping for Minimum stock of metal sheets

lamiera <- read.csv2("foo/Lamiera.csv", sep= ";", dec = ",")
lamiera <- lamiera[lamiera$Year >= '2014',] # Only considering years 2014 onwards. By EDA boxplot(quantity~Year), 2014 is the first year of greater variability

# Fitting model: glm, Gamma distr with log-link function
lmTest <- glm(Quantity~as.factor(TenDaysOfYear)+as.factor(PartNo), data = lamiera, family= Gamma(link= 'log'))


# Estimating CI and plotting

alpha <- 0.98 # Significance level

codiciLamiera <- unique(lamiera$PartNo)
for(codLamiera in codiciLamiera){
  
  
  preddata <- data.frame('TenDaysOfYear' = seq(1,36, by=1), 'PartNo' = codLamiera)
  preds <- predict(lmTest, newdata = preddata, type = "link", se.fit = TRUE)
  
  critval <- qt(alpha, nrow(lamiera)-1)
  upr <- preds$fit + (critval * preds$se.fit)
  lwr <- preds$fit - (critval * preds$se.fit)
  fit <- preds$fit
  fit2 <- lmTest$family$linkinv(fit)
  upr2 <- lmTest$family$linkinv(upr)
  lwr2 <- lmTest$family$linkinv(lwr)
  
  name <- paste('foo/Lamiera.csv','lamiera_', codLamiera,'.png', sep = '')
  
  png(filename = name)
  plot(Quantity~TenDaysOfYear, data= lamiera[lamiera$PartNo == codLamiera,], main=codLamiera)
  lines(upr2, col="red")
  points(fit2, col="green")
  dev.off()
  
  name <- paste('foo/Lamiera.csv','lamiera_', codLamiera,'.csv', sep = '')
  preddata$PredictedConsumption <- round(fit2, 2)
  preddata$MinimumStock <- round(upr2, 2)
  write.csv2(preddata, file = name, row.names = F)
}



# Model checks: all assumptions are well respected
plot(rstandard(lmTest), ylab = "Standardized residuals") # Super-nice looking plot: homoschedasticity respected
resOrdered <- rstandard(lmTest)[order(lamiera$TenDaysOfYear)]
plot(1:nrow(lamiera), resOrdered) # Independence respected
hist(rstandard(lmTest), main="Distribution of standardized residuals", xlab="Standardized residuals")
qqplot(rnorm(length(rstandard(lmTest))) ,rstandard(lmTest), xlab = "Std Normal sample", ylab = "Standardized residuals")
abline(c(0,1), col= "red")# Normality respected
plot(fitted(lmTest), residuals(lmTest, type = "pearson"), xlab = 'Fitted values', ylab = 'Pearson residuals') # mild pattern of pearson residuals, but can be overlooked due to big sample size
# Linearity: respected by construction (i.e. scale of link is always linear)

Business Analytics con retaileR (parte II)

Ecco la seconda parte dell’articolo sul nuovo pacchetto R retaileR. La prima parte potete trovarla qui.

Le Funzioni

closure.opp.cost

closure.opp.cost serve per stimare il costo di opportunità in funzione del tempo di chiusura dei locali di retail. Ossia, una stima del valore monetario perso a seconda dell’orario di chiusura.

I suoi argomenti sono:

Utilizzando il dataset preformattato, ecco un esempio:

closure_opp_cost("16:15:00", X= sales_august)

e l’output:

$Mean
[1] 10.35812

$Median
[1] 10.35

$StDev
[1] 8.392746

$`Late Sales`
[1] 10.30 18.90 6.30 26.70 14.60 10.60 0.05 24.18 10.40 9.60 14.10 14.90 5.10 0.00 0.00 0.00

In questo caso, chiudere il negozio alle 16:15:00 comporterebbe una perdita media di 10.35 €, con una varianza di 8.39 €. Inoltre, dal vettore Late.sales si può vedere chiaramente come 3 giorni non ci siano state vendite dopo le 16:15:00.

segment_prod_line

Questa funzione è stata creata per l’analisi delle diverse famiglie di prodotti secondo un determinato segmento temporale.

Ossia, se la variabile temporale del dataset input è giornaliera, segment_prod_line produrrà le vendite medie giornaliere per la product line di interesse.

Vediamone il funzionamento:

specifichiamo la linea di prodotti come un vettore di stringhe:

prods &lt;- c("Americano", "Espresso")

e successivamente evochiamo segment_prod_line

segment_prod_line(sales_august, prods)

ecco l’output:

[1] 31.56065

ciò significa che nei giorni che il dataset copre, abbiamo venduto in media 31.5 Americani ed Espressi al giorno.

Mean.items.times

Per ultima, una funzione di analisi temporale delle vendite. mean.items.times calcola le vendite medie (in volume) per ogni ora. Ossia crea un istogramma con l’unità temporale sull’asse delle ascisse e le vendite medie sull’asse delle ordinate.

Ecco il funzionamento (che ricordo, è ottimizzato per un dataset iZettle):

 mean.items.times(sales_august)

e l’output corrispondente:

       Time  Quantity
1  05:00:00  1.000000
2  06:00:00  9.000000
3  07:00:00 23.100000
4  08:00:00 30.695652
5  09:00:00 24.826087
6  10:00:00 17.695652
7  11:00:00 16.000000
8  12:00:00 11.173913
9  13:00:00 10.608696
10 14:00:00  8.590909
11 15:00:00  7.714286
12 16:00:00  6.000000
13 17:00:00  2.000000
14 18:00:00  1.000000

‘Un contributo alla statistica’ di Wislawa Szymborska

No comment: semplicemente una poesia trovata recentemente. A base di statistica ovviamente. Buona lettura.

Un contributo alla statistica di Wislawa Szymborska

Su cento persone:

che ne sanno sempre piu’ degli altri
– cinquantadue;

insicuri a ogni passo
– quasi tutti gli altri;

pronti ad aiutare,
purche’ la cosa non duri molto
– ben quarantanove;

buoni sempre,
perche’ non sanno fare altrimenti
– quattro, be’, forse cinque;

propensi ad ammirare senza invidia
– diciotto;

viventi con la continua paura
di qualcuno o qualcosa
– settantasette;

dotati per la felicita’
– al massimo poco piu’ di venti;

innocui singolarmente,
che imbarbariscono nella folla
– di sicuro piu’ della meta’;

crudeli,
se costretti dalle circostanze
– e’ meglio non saperlo
neppure approssimativamente;

quelli col senno di poi
– non molti di piu’
di quelli col senno di prima;

che dalla vita prendono solo cose
– quaranta,
anche se vorrei sbagliarmi;

ripiegati, dolenti
e senza torcia nel buio
– ottantatre’
prima o poi;

degni di compassione
– novantanove;

mortali
– cento su cento.
Numero al momento invariato.

Business analytics con retaileR (parte I)

Perdonate la lunga assenza: sono stato molto impegnato in questi ultimi mesi. Tuttavia, per farmi perdonare, ho sviluppato una chicca tutta per voi:

un nuovissimo pacchetto R.

L’idea di creare un pacchetto R mi ha sempre affascinato. In fondo, è proprio grazie ai contributi Open Source che questo linguaggio continua ad arricchirsi. Tuttavia, mi è sempre mancato un ambito di applicazione, ossia non ho mai immaginato per cosa costruire il pacchetto. Per fortuna, negli ultimi mesi ho partecipato ad un progetto che mi ha introdotto al mondo del retail (il commercio Business to Consumer per intenderci) ed ai suoi peculiari problemi: da qui è nata l’idea di costruire una serie di funzioni che aiutino l’analisi dei dati in queso settore.

Il pacchetto, sviluppato in R, è chiamato retaileR e potete trovarlo sul mio Github. Nei prossimi mesi cercherò di renderlo disponibile sul CRAN, la repository ufficiale di tutti i pacchetti R.

Che cosa può fare retaileR? Retailer è nato con l’idea di creare un gruppo di funzioni utili per l’analisi dei dati di vendita. In questo modo, il processo decisionale verrebbe reso più semplice e con stime più efficaci. retaileR può aiutare nel decidere l’orario di chiusura di una filiale, organizzare lo staffing individuando gli orari di massimo/minimo volume e molte altre funzioni simili.

In questa serie di articoli illustrerò le funzioni e provvederò degli esempi su come utilizzare questo nuovo pacchetto.

I Dati

Il dataset di base viene considerato quello creato in automatico da iZettle, un sistema POS per il commercio al dettaglio. Qui sotto le prime righe di un dataset illustrativo (che userò negli esempi successivi) e che potrete trovare sempre in GitHub.

Carichiamo il dataset delle vendite fittizie relative ad un mese e visualizziamone le prime righe.

sales_august &amp;lt;- read.csv(“foo/sales_august.csv”)
head(sales_august)

Ecco l’output:

        Date                Time Receipt.number       Name    Variant Unit Quantity Price..GBP. Discount..GBP. Final.price..GBP.
1 2017-08-01 2017-08-20 06:19:00           7359  Americano              NA        1         2.5           0.00              2.50
2 2017-08-01 2017-08-20 06:19:00           7359   Espresso              NA        1         2.0           0.00              2.00
3 2017-08-01 2017-08-20 06:28:00           7360      Mocha              NA        1         3.1           1.55              1.55
4 2017-08-01 2017-08-20 06:43:00           7361 Flat White              NA        2         5.4           0.00              5.40
5 2017-08-01 2017-08-20 06:52:00           7362  Macchiato              NA        1         2.2           0.00              2.20
6 2017-08-01 2017-08-20 06:52:00           7362 Extra Shot Extra shot   NA        1         0.2           0.00              0.20

Le Funzioni

sales.format

questa funzione si propone come l’inizio di ogni analisi con retaileR poiché la sua funzione è formattare correttamente i dati da analizzare. In particolare, essa agisce sulle variabili di tempo (i.e. data e ora) formattandole in maniera uniforme.

Ecco un esempio:

sales_august <- sales.format(sales_sept)
[\code]

I suoi argomenti sono: X, date_var= “Date”, time_var= “Time”, format_time = ‘%H:%M’, dove:

X: indica il data frame contenente i dati da analizzare.
date_var è il nome, come stringa, della variabile contenente la data. Default a “Date”.
time_var idem come date_var ma per il tempo (ossia l’ora). Default a “Time”.
format_time come deve essere formattato il tempo. Per usare retaileR è necessario il formato ‘ora : minuto’ indicato appunto dal default ‘%H:%M’.

as.sales

l’analisi dei dati di vendita crea particolari problemi a livello della variabile temporale. Infatti, essendo una serie temporale, le vendite possono essere raggruppate seguendo tempi diversi: ore, giorni, settimane etc.

Inoltre, non è semplice capire se il di crescita sia positivo o negativo: la stagionalità e i movimenti casuali rendono molto spesso difficile questo compito.

Per evitare questi due problemi, la funzione as.sales crea una nuova classe di oggetti (in S4) che include una semplice regressione lineare e diversi raggruppamenti temporali.

i suoi argomenti sono:

X: indica il data frame contenente i dati da analizzare.
date_var è il nome, come stringa, della variabile contenente la data. Default a “Date”.
time_var idem come date_var ma per il tempo (ossia l’ora). Default a “Time”.
sales_var è il nome, come stringa, della variabile contenete il valore monetario di ogni singola vendita. Defaulta a “Final.price..GBP.”.
format_time come deve essere formattato il tempo. Per usare retaileR è necessario il formato ‘ora : minuto’ indicato appunto dal default ‘%H:%M’.
discount_var, nome colonna contenente gli sconti. Il default è “Discount..GBP.”.

Come output, la funzione produce un oggetto S4 sales, la cui struttura è la seguente:


ecco l'output:

 with 5 slots
  ..@ Daily           :'data.frame':	16 obs. of  2 variables:
  .. ..$ Date             : Date[1:16], format: "2017-08-01" "2017-08-02" "2017-08-03" "2017-08-04" ...
  .. ..$ Final.price..GBP.: num [1:16] 287 268 280 342 337 ...
  ..@ Weekly          :'data.frame':	3 obs. of  2 variables:
  .. ..$ Week             : num [1:3] 31 32 33
  .. ..$ Final.price..GBP.: num [1:3] 1514 2757 1565
  ..@ Monthly         :'data.frame':	1 obs. of  2 variables:
  .. ..$ Month            : num 8
  .. ..$ Final.price..GBP.: num 5836
  ..@ Discount        : num 39.8
  ..@ sales_funct_time:List of 12
  .. ..$ coefficients : Named num [1:2] -201711.1 11.6
  .. .. ..- attr(*, "names")= chr [1:2] "(Intercept)" "Date"
  .. ..$ residuals    : Named num [1:16] 9.79 -21.28 -20.64 29.43 13.04 ...
  .. .. ..- attr(*, "names")= chr [1:16] "1" "2" "3" "4" ...
  .. ..$ effects      : Named num [1:16] -1459 214.3 -18.6 30.4 13.1 ...
  .. .. ..- attr(*, "names")= chr [1:16] "(Intercept)" "Date" "" "" ...
  .. ..$ rank         : int 2
  .. ..$ fitted.values: Named num [1:16] 278 289 301 312 324 ...
  .. .. ..- attr(*, "names")= chr [1:16] "1" "2" "3" "4" ...
  .. ..$ assign       : int [1:2] 0 1
  .. ..$ qr           :List of 5
  .. .. ..$ qr   : num [1:16, 1:2] -4 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:16] "1" "2" "3" "4" ...
  .. .. .. .. ..$ : chr [1:2] "(Intercept)" "Date"
  .. .. .. ..- attr(*, "assign")= int [1:2] 0 1
  .. .. ..$ qraux: num [1:2] 1.25 1.27
  .. .. ..$ pivot: int [1:2] 1 2
  .. .. ..$ tol  : num 1e-07
  .. .. ..$ rank : int 2
  .. .. ..- attr(*, "class")= chr "qr"
  .. ..$ df.residual  : int 14
  .. ..$ xlevels      : Named list()
  .. ..$ call         : language lm(formula = as.formula(paste(sales_var, "~", date_var)), data = day)
  .. ..$ terms        :Classes 'terms', 'formula'  language Final.price..GBP. ~ Date
  .. .. .. ..- attr(*, "variables")= language list(Final.price..GBP., Date)
  .. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. ..$ : chr [1:2] "Final.price..GBP." "Date"
  .. .. .. .. .. ..$ : chr "Date"
  .. .. .. ..- attr(*, "term.labels")= chr "Date"
  .. .. .. ..- attr(*, "order")= int 1
  .. .. .. ..- attr(*, "intercept")= int 1
  .. .. .. ..- attr(*, "response")= int 1
  .. .. .. ..- attr(*, ".Environment")=<environment: 0x102165000> 
  .. .. .. ..- attr(*, "predvars")= language list(Final.price..GBP., Date)
  .. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "other"
  .. .. .. .. ..- attr(*, "names")= chr [1:2] "Final.price..GBP." "Date"
  .. ..$ model        :'data.frame':	16 obs. of  2 variables:
  .. .. ..$ Final.price..GBP.: num [1:16] 287 268 280 342 337 ...
  .. .. ..$ Date             : Date[1:16], format: "2017-08-01" "2017-08-02" "2017-08-03" "2017-08-04" ...
  .. .. ..- attr(*, "terms")=Classes 'terms', 'formula'  language Final.price..GBP. ~ Date
  .. .. .. .. ..- attr(*, "variables")= language list(Final.price..GBP., Date)
  .. .. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. .. ..$ : chr [1:2] "Final.price..GBP." "Date"
  .. .. .. .. .. .. ..$ : chr "Date"
  .. .. .. .. ..- attr(*, "term.labels")= chr "Date"
  .. .. .. .. ..- attr(*, "order")= int 1
  .. .. .. .. ..- attr(*, "intercept")= int 1
  .. .. .. .. ..- attr(*, "response")= int 1
  .. .. .. .. ..- attr(*, ".Environment")=<environment: 0x102165000> 
  .. .. .. .. ..- attr(*, "predvars")= language list(Final.price..GBP., Date)
  .. .. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "other"
  .. .. .. .. .. ..- attr(*, "names")= chr [1:2] "Final.price..GBP." "Date"
  .. ..- attr(*, "class")= chr "lm"

Come si può osservare, l’oggetto creato da as.sales contiene 5 sub-oggetti:

3 raggruppamenti di tempo: daily, weekly, monthly.
Un valore dei discount totali nel dataset.
Una oggetto lm di regressione lineare, utile per comprendere il trend.

Le Macchine possono vederti

Sulla costruzione di algoritmo di riconoscimento visivo

Avrete certamente notate la mia più che settimanale assenza. Purtroppo sono stato impegnato per motivi lavorativi: per farmi perdonare, ho deciso di produrre un bell’articolo (con tanto di programma Python) su un argomento iper attuale: la Machine Vision.

Intro

La Machine Vision raccoglie tutte le tecniche algoritmiche per il riconoscimento automatico di immagini. In essenza, utilizzando queste tecniche, è possibile insegnare ad una macchina a riconoscere gli oggetti più disparati (anche in tempo reale): l’applicazione più semplice é ad esempio la lettura automatica di testi. Come quando vi arriva la multa a casa, perché siete entrati come dei fessi nella ZTL: la telecamera, partendo da una foto, ha automaticamente letto la targa del vostro carro.

La teoria

A prima vista potrebbe sembrare che algoritmi di tale portata possano essere complessi e sostanzialmente distaccati dalla teoria di Data Science. In realtà, l’algebra lineare è ancora una volta onnipresente e viene in aiuto. Difatti, i metodi di Machine Vision non differiscono di molto dai metodi di Data Science in generale. Le differenze più marcate sono in come i dati (i.e. le immagini) vengono pre-processate, prima di essere inviate ad un classificatore (e.g. una Random Forest o una Rete Neurale).

Ordunque, qual’è la logica generale da seguire? Le immagini presentano un inevitabile problema di fondo: sono “pesanti” dal punto di vista di utilizzo memoria. Infatti, anche una piccola foto in b\n di dimensioni (in pixel) 100×100 rappresenta, per un calcolatore, una matrice di 10.000 elementi. Triplicate il valore per una in RGB.

Dunque, a meno di avere a disposizione una elevata potenza di calcolo, il classificatore impiegherà un tempo non trascurabile per stimare tutti i parametri.

Per ovviare a questo problema, non si deve nutrire il modello con le immagini pure, ma bisognerà estrarre delle misure (in inglese features) che riassumano l’informazione contenuta in esse. Di queste misure ne esistono di tutti i gusti ed è sostanzialmente impossibile riassumerle tutte. Come esempio, si prendano quelle utilizzate nel programma descritto più avanti nell’articolo.

Le features diventano dunque le nostre nuove variabili. In linea generale, il nostro problema viene nuovamente ridotto al generale:

$Y=f(X)$

dove X è un vettore di lunghezza p contenente i valori per le p misure e $Y= \in \{0,1, . . ., k\}$ rappresentante le k classi di immagini.

Un semplice esempio per chiarire: costruiamo un algoritmo per permettere al computer di riconoscere il colore blu dal rosso e viceversa, utilizzando immagini di 10×10 pixels. In questo caso avremo n immagini, alcune completamente rosse e altre blu. Per n abbiamo scelto 120.

Innanzitutto, importiamo le librerie che ci serviranno:

from PIL import Image
import numpy as np
import os
from os import walk
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn import cross_validation

Ora doppiamo si procede con la creazione delle cartelle e delle immagini. Al tempo stesso si deve anche creare il vettore della variabile dipendente (indicato con Y).


# -------------------------------------------------------------------------------------- #
# Controlla l'esistenza di due cartelle per salvare le immagini. Se non presenti, le crea
directories = ["/foo/rosso/", "/foo/blu"]
for dir in directories:
    if not os.path.exists(dir):
        os.makedirs(dir)
# -------------------------------------------------------------------------------------- #

files = []
t = 0  # contatore
size = 10 # dimensione k, in pixel, di immagini kxk
iterations = range(0, 60) # costruisce 60 immagini
Y = []

for u in iterations:

    k1 = int(np.random.uniform(0, 100)) # Valori aleatori per introdurre variabilità nei colori
    k2 = int(np.random.uniform(0, 100))
    k3 = int(np.random.uniform(0, 100))
    k4 = int(np.random.uniform(0, 100))

    col = [["rosso/", (255, k1, k2)], ["blu/", (k3, k4, 255)]] # paletta colori

    for n,i in col:
        im = Image.new("RGB", (size, size))
        pix = im.load()

        for x in range(size):
            for y in range(size):
                pix[x, y] = i # setta il colore dei pixel
                
        filename = directory + n + str(t) + ".png"
        im.save(filename, "PNG")
        files.append(filename) # crea la lista dei filenames
        Y.append(n[0:-1]) # Crea il vettore variabile dipendente con le classi
        t += 1

Una volta creato il dataset e la variabile dipendente, si può procedere all’estrazione delle misure, che andranno a creare la nostra matrice X delle variabili indipendenti. Come già accennato, la scelta di quali features includere è notevolmente specifica al problema da risolvere. In questo semplice esempio, la distribuzione dei colori dei pixel sui 3 canali (RGB) è sufficiente.

Nota tecnica: nell’ultimo passaggio logico, se in R, sarebbe meglio vettorizzare la funzione anziché evocarla in loop. Ma usando Py, il for non crea troppi problemi.

print("Extracting features")
def extract_features(path):

    ''' Extract features from image of MetroBank Clapham Junction: colours distributions, sdev of the colours, number
    of objects and several measurements on those objects.
    Input: path - string with path of the picture
    Output: a vector '''


    # Resizing to 1000x1000, pixel distribution for each colour channel according to a fixed number of bins
    im = Image.open(path)
    im_arr = np.array(im)

    hist_R = np.histogram(im_arr[:, :, 0].flatten()) # Distibuzione Rosso
    hist_G = np.histogram(im_arr[:, :, 1].flatten()) # Distribuzione Verde
    hist_B = np.histogram(im_arr[:, :, 2].flatten()) # Distribuzione Blu

    features = np.concatenate((hist_R[0], hist_G[0], hist_B[0], hist_R[1], hist_G[1], hist_B[1]), axis=0)

    features = np.reshape(features, (1, len(features)))
    return (features)

# ------------------------------------------------------------------------------- #
# Evoca la funzione 'extract_features()' in loop. In R sarebbe meglio vettorizzare.

X = np.zeros((len(files), 63)) # ncol da modificare a seconda del'entrata
i = 0
for im in files:
    feat = extract_features(im)
    X[i, 0:len(feat[0])] = feat[0]
    i += 1
    print(i/len(files)*100, '%', ' done')

E alla fine, basta costruire un classificatore: in questo caso, data la semplicità dell’esempio, ho scelto una normale Random Forest. Facile da utilizzare avendo pochi parametri, non crea troppi problemi di overfitting e funziona straight out of the box.

# -------------------------------------------------------------------------------------------- #

# Stima del modello: Random Forest

print("Training")
# n_estimators is the number of decision trees
# max_features also known as m_try is set to the default value of the square root of the number of features
clf = RF(n_estimators=100, n_jobs=3)
scores = cross_validation.cross_val_score(clf, X, Y, cv=5, n_jobs=1) # Effettua la Cross Validation con 5 folds
print("Accuracy of all classes")
print(np.mean(scores)) # Stampa la media del punteggio di Cross Validation (accuracy in questo caso)

Se eseguite il programma, vedrete che l’accuracy media è di 1: abbiamo dunque costruito un modello perfetto per distinguere i colori di semplici immagini. L’accuracy di 1 tuttavia non è sempre un buon segno: potrebbe voler dire che il nostro modello è in over-fitting, ossia è troppo adattato ai dati osservati ma non è generalizzabile. In questo caso però non rischiamo di cadere in questo problema poiché il training set é molto semplice e praticamente privo di rumore.

Enjoy!