Alberto Ferrando

Helioscope tutorial

The following are short notes describing the process to design a commercial PV installation using Helioscope.

Step 1: new project

Write name of project, select the location by address or coordinates. Select the type of installation (this case will be commercial; other options allow for different default settings); enter a brief description if needed.

Step 2: create new design

Create a new design by selecting the button highlighted in red and write a name in the dialog box that appears. If you want, you can copy an existing design and modify it, or otherwise start from scratch.

Step 3: select source for geospatial data

The source can be selected in the top right corner and the options are Google, Bing, Google Street and Nearmap; the former is only available by subscription. Ideally, when working with the roof of buildings, we would like a perfectly perpendicular image i.e. an image of the roof that does not show the walls. For example, the red area in the image from google are walls. While the Bing image is much more perpendicular. We are going forward with the image from Bing.

Step 3: draw the field

We now have to draw the plane(s) on which to design our array, so that we can have some boundaries. We do this by clicking in the left side on mechanical and then new; the pointer will become green and will allow us to draw the contours of the roof. One click places a vertex and clicking while holding shift allows for 90 degrees angles. Double click the last vertex to finish the plane.

Step 4: select layout rules

Subsequently, we have to select the layout rules. These are:

Racking:

In this case, we are going to select a fixed tilt racking.

Fixed tilt racking: when the tilt of the panels does not vary with time but it is fixed during installation.
Flush mount racking: ideal for pitched roofs, panels are mounted flush with the roofing, following its angle with the ground. Reduced lift force from wind
East-west racking: solar panels are installed in couples, longitudinally, facing east-west directions, at a pitch angle. This is ideal for locations close to the equator and to reduce lift force from wind.
Carport: to maximize shading underneath the panels, to be used, as suggested by the name, as parking covers.

Module height needs to take into account the height of the building where the panels are mounted on. In this case, given the lack of real measurements, we will assume a height of 165ft.

The azimuth angle is defined as the angle between the direction the panels are facing with the geographic south. The ideal angle can be selected via calculations, in order to maximize irradiance on a yearly basis. In this case, we simply align the panels with the south wall.

The tilt angle indicated the angle between the panels and the horizontal. As a rule of thumb, the tilt should be equal to the latitude of the location. In this case, we select 30 degrees.

The specific product selected will mostly depend on the energy demand of the application and the specific stock of the installer. In this case, we will select the standard TSM-PD14 320W.

The automatic layout rules allow the user to select different design characteristics which are very much project dependent. In this case, we will simply accept the standard values given that we do not have a real-life project constraint to work on. However, will will introduce a 6ft setback for safety. The setback is the distance between the rows of panels and the edges of the plane.

Step 5: select keepouts

Keepouts are areas where panels will not be installed, either due to physical obstructions or for maintenance purposes. In our case, LIDAR imagery shows the presence of 3 big heat pumping or air filtering units. Hence, we can assume a keep out area around them of about 10ft height.

Raised keepouts also generate shade. To analyzed which modules will be affected by the shade, we can click on Advanced, Shading and Calculate Shading. We can then set a threshold for the maximum shading accepted (in this case it will be 20%), so that we can remove the most affected panels.

Step 6: electrical

Next, the electrical tool allows you to select inverters, wiring and DC subsystems (i.e. recombiners that consolidate power from the modules into one main wire that is then fed to the inverter). In our case, we moved the inverters to the edges in order to facilitate maintenance and we left an AC/DC ratio of 1.20, to avoid clipping losses as explained in this page. Different wiring can be selected to reduce the losses.

We also added an AC home run to connect all the inverters. Further technical help on the electrical design can be found here.

Step 7: simulation

Conditions sets can be selected; they identify the assumptions like weather data and loss coefficients. In this case, we will leave the standard values. The simulation report looks like the figure below.

The Mirai malware & IoT security

Currently, I am working for a telecom company specialized in IoT devices. Essentially, we provide cellular communication options for IoT projects worldwide. I interact daily with customers deploying their devices in a variety of fields: vehicle tracking, energy monitoring and provisioning, security and industrial applications. While their applications are all different, I noticed that one thing never changes: the security threats to those devices. Regularly, every month, I have to deal with customers whose devices have been infected by malwares. While the damage to their operation is usually minimal, the data communication charges are not, leading to significant monetary consequences.

Given the prevalence of these situations, I thought I would dig deeper into the world of IoT security, to understand why these excess charges are happening. During this quest, I discovered the source of IoT malwares: Mirai, a botnet developed in 2016. Therefore, I used the opportunity of this paper to study Mirai, how it works and how we can defend from it.

Mirai Architecture

In the fall of 2016, multiple high-profile websites (e.g. Netflix, GitHub, Reddit) found themselves rendered inaccessible by a very powerful DDoS attack, estimated, by some accounts, to have reached 1.2 Tbps (The Economist, 2016), an unprecedented size.

After that attack, the source code and instructions of a new IoT malware named Mirai was release as open-source on a hacking forum. Since then, multiple derivative malwares have been created that are now bringing into spotlight the multiple security issues of the IoT world.

Mirai is a malware engineered to perform DDoS attacks (Distributed Denial of Service). The aim of those attacks is to overwhelm the target server by flooding it with superfluous requests, in order to interrupt its services and prevent legitimate clients to acces it (C. Douligeris, 2004). To achieve this aim, Mirai discovers, infects and controls unprotected (or loosely protected) IoT devices connected to the internet, to use them as bots and coordinating them to carry DDoS operations (Margolis, et al., 2017).

Its architecture is shown in Diagram 1. Initially, the Command-and-Control (C&C) servers runs two socket listeners: one on port 23 for Telent connections and one on port 101 for programmatic API. The C&C server is written in Go.

Depending on the type of data sent to the Telnet socket (either a 4-byte integer or something else), an Admin Handler or Bot Handler are created. The Admin handler is an interactive prompt that allows users to manage bots and attacks (all data is stored in a MySQL DB); on the contrary, a Bot Handler handles the connection of a new bot. If the former successfully communicates with the latter, a new bot is created and its ID added to the above database.

The API socket, instead, has only one function: listening for orders of attacks. The syntax of these orders is as follows:

-n <attack type> <1.1.1.1[/8][,] …>

<attack length> [flag=val …]

The -n controls the number of bots to use, the IP can be provided either as a single value or as a range, while the attack length defines the duration of the attack in seconds (from 0 to 3600). The attack type, chosen from 10 different types, defines the “flavor” of the malicious operation.

To infect a new device, each bot randomly selects and IP and then checks it against a table of forbidden addresses; to note, the IPs of the US Postal Service and of the Hewlett-Packard Corp. are expressly protected from the attacks. If the IP is valid, the bot will then launch a dictionary attack using a list of 60 factory default login credentials that can be seen in Table 1.

After a new device is found, its IP and login credentials are sent to the loader which will contact the device, download and load a binary of the malware. This new bot will then start to infect new devices, creating a cycle that allows Mirai to grow its networks of bot in an exponential fashion.

It is exactly this ability to easily infect devices that makes Mirai a formidable threat. While the types of attacks it can perform are nothing new, its worming capabilities are exceptionally dangerous. In addition, the source code of Mirai has been released as open source since 2017 (Gamblin, 2017); consequently, many derivatives malware have been developed since then.

Username	Password	Username	Password	Username	Password	Username	Password
666666	666666	root	7ujMko0admin	admin	smcadmin	root	realtek
888888	888888	root	7ujMko0vizxv	admin1	password	root	root
admin	(none)	root	888888	administrator	1234	root	system
admin	1111	root	admin	Administrator	admin	root	user
admin	1111111	root	anko	guest	12345	root	vizxv
admin	1234	root	default	guest	guest	root	xc3511
admin	12345	root	dreambox	mother	f****r	root	xmhdipc
admin	123456	root	hi3518	root	(none)	root	zlxx.
admin	54321	root	ikwb	root	0	root	Zte521
admin	7ujMko0admin	root	juantech	root	1111	service	service
admin	admin	root	jvbzd	root	1234	supervisor	supervisor
admin	admin1234	root	klv123	root	12345	support	support
admin	meinsm	root	klv1234	root	123456	tech	tech
admin	pass	root	pass	root	54321	ubnt	ubnt
admin	password	root	password	root	666666	user	user

Table 1: the factory default usernames & passwords used by Mirai

Defensive approaches

Mitigating the risks posed by Mirai is somewhat difficult in that it uses legitimate credentials to gain access to devices which are usually not scanned for malicious activity. Nevertheless, there are some important defensive actions that can significantly reduce the risk of infection:

Change device credentials: after the deployment of the device, the user must change the access credentials for SSH or Telnet access. It is important to note that the credentials that can be changed via the device GUI are not always the credentials used for SSH or Telent connections.
Closed unused ports: as described before, ports 22 and 23 are the ones exploited by Mirai to gain access. Hence, they should never be publicly accessible. Moreover, they should be blocked at router level to avoid any access to internal devices.
Monitor ioctl: Mirai (and its derivatives) have the common pattern of sending an ioctl request to the watchdog timer. This is done to prevent the watchdog to restart unexpectedly. Given that the watchdog is a fundamental fail-safe system for Linux IoT devices (Weingel, 2007), its disabling should never occur. Hence, monitoring its activity is a very important step to identify the presence of Mirai.
White-hat script: an automated white-hat penetration script can be used to identify vulnerable devices in the network. Once they are found, the script should attempt login in a manner similar to Mirai. This method could help in identifying exposed devices in very large network, where the risk of human error is greater.

Other security considerations for IoT

During my research, I also encountered many other challenges, other than malware, in securing an IoT project (Zhang, et al., 2014):

Identifying objects in the network: until now, most IoT application used the DNS system to identify (and name) objects in the network. However, such system is still vulnerable to cache poisoning & man-in-the-middle attacks, which can inject fake DNS records in the target cache.
Authentication & authorization: while many public-key (such as the ones studied during this course) provide a theoretically sound system from authentication & authorization, the absence of a global root Certificate Authority prevents those crypto-systems to be effective. In addition, it may prove impractical to provide a certificate to each IoT object given the sheer number of devices.
Cryptosystems & security protocols: public-key cryptosystems are highly desirable because they generally provide advanced security features. However, they are very often inoperable on IoT systems, given the resource constraints that these objects have.

In addition, (Zarca, et al., 2018) also point out that Software Defined Networks (SDN) could help in devising new defensive approaches. According to the authors, SDN have a number of defensive benefits:

Dynamic Flow Control: by decoupling the data from the network plane, there is now the possibility to enable dynamic access control functions depending on specified privileges and policies
Traffic Isolation: which allows to flexibly isolate compromised sections of the network
Network-wide visibility and monitoring: given that the SDN traffic is managed by a centralized controller, there is now the possibility of monitoring the traffic peak generated by compromised devices in the network.

Works Cited

The Economist, 2016. The internet of stings. [Online]
Available at: https://www.economist.com/science-and-technology/2016/10/08/the-internet-of-stings
[Accessed 3 February 2021].

C. Douligeris, A. M., 2004. DDoS attacks and defense mechanisms: classification and state-of-the-art.Computer Networks, 5 April, pp. 643-666.

Gamblin, J., 2017. GitHub Mirai-Source-Code. [Online]
Available at: https://github.com/jgamblin/Mirai-Source-Code
[Accessed 06 February 2021].

Weingel, C., 2007. The Linux Watchdog driver API. [Online]
Available at: https://www.kernel.org/doc/html/latest/watchdog/watchdog-api.html
[Accessed 6 February 2021].

Margolis, J. et al., 2017. An In-Depth Analysis of the Mirai Botnet. Altoona, PA,, International Conference on Software Security and Assurance (ICSSA).

Zhang, Z., Cho, M. C. Y., Wang, C. & Hsu, C., 2014. IoT Security: Ongoing Challenges and Research Opportunities. Matsue, Japan, IEEE 7th International Conference on Service-Oriented Computing and Applications.

Zarca, A. M., Bernabe, J. B., Farris, I. & Khettab, Y., 2018. Enhancing IoT security through network softwarizationand virtual security appliances. International Journal of Network Management, 28(5).

On data science and IoT

Since several days have been reflecting on the deep connection between data science (or what we refer to predictive modeling) and IoT. IoT is commonly defined as (Rouse, 2019):

a network of interconnected computing devices, mechanical actuators and sensors able to exchange data between themselves without the need of human interaction.

It’s clear to me that the connection between this new network of things and data science is striking; In fact, I strongly believe the real revolution will come when these two branches of technology will finally be recognized as deeply related. I imagine a future where the data collected from the sensors will be transformed into insight and information by the machine learning algorithms and will automatically trigger a response in the physical world thanks to the physical actuators always connected on the Internet.

Until now, data science has mostly focused on social network-generated data or Internet generated data (e.g. pictures, text mining on Twitter, etc); the insights that can be gathered from this kind of data is indeed limited in scope because no physical reaction can be triggered; or better, no improvement in efficiency can be triggered by using such data. On the contrary the data generated by the IoT world will pertain to the physical realm: think for example at the footfall in the city or the numbers of/the type of nutrient required by a crop field. All this data will be transmitted automatically and instantaneously over the Internet to algorithms able to predict and decide what to do based. This in turn will trigger a mechanical or chemical action inducing a response that is predetermined by humans using Machine Learning.

It is clear to me that the connection of the two technologies will be very important for humanity at large and it will be a multiplier of human capabilities in almost all fields of the physical realm.

Richieste simultanee con Python

Oggi ho dovuto testare un API, inviando richieste multiple con Python. Ovviamente, le richieste non devono essere sequenziali ma simultanee. Ossia, un semplice for loop non basta.

Ergo, ho trovato una soluzione usando molteplici processori. Utilizzare un sistema come Amazon Web Services EC2 o SageMaker fa ovviamente la differenza, perché permette di aumentare i cores a disposizione.

Ecco dunque un esempio della soluzione che ho scelto:

import requests
from concurrent.futures import ThreadPoolExecutor

def get_url(url):
return requests.get(url)

list_of_urls = ["https://postman-echo.com/get?foo1=bar1&foo2=bar2"]*10

with ThreadPoolExecutor(max_workers=10) as pool:
response_list = list(pool.map(get_url,list_of_urls))

for response in response_list:
print(response)

Selezioni ibride su un DataFrame pandas multi-indice

La selezioni di viste da un dataframe è cosa basilare per un Data Scientist. L’altro giorno al lavoro ho però incontrato una situazione spinosa:

selezionare dati da un dataframe multi-indice con maschera booleana e con specifico valore in indice.

La soluzione è il metodo pandas.DataFrame.query()

Vediamo un esempio pratico. Il nostro input è:

import pandas as pd
index = pd.MultiIndex.from_product([['xy1','xy2','xy3'], ['1','2','3','4','5']], names=['Tag', 'Page'])
df = pd.DataFrame([1,1,1,4,5,1,1,61,4,51,1,1,4,5,1], index, columns=['Value'])

mentre ciò che vogliamo ottenere è questo:

index2 = pd.MultiIndex.from_product([['1','2','3']], names=['Page'])
df2 = pd.DataFrame([1,1,1], index2, columns=['Value'])
df2

Usando pandas.DataFrame.query() possiamo risolvere il dilemma in molto molto pythoniano:

df1 = df.query("Tag == 'xy1' & Value == 1").reset_index(level=0, drop=True)

(il .reset_index() è utilizzato per resettare l’indice e ottenere il risultato cosmetico voluto)

Forecasting di vendite: serie storiche e ML

Predirre le vendite è il sogno più bagnato di ogni manager che si rispetti. Difatti, avere stime affidabili sull’andamento del mercato è la chiave del budgeting: più le stime sono corrette, più la programmazione diventa facile ed effettiva.

Come in ogni problema di predizione, la variabile che influisce di più sulla bontà del modello è l’informazione disponibile. Nella mia esperienza lavorativa, la quasi totalità delle aziende ha (per ora) una cultura del dato molto scarsa, se non nulla. Il data scientist si troverà dunque a dover lavorare con dati parziali, da reclamare tramite molteplici fonti, la cui affidabilità sarà semplicemente dubbia. Ovviamente, tale ostacolo è prettamente individuale e varierà da scenario a scenario: è dunque fuori dallo scopo di questo post.

Quello di cui voglio veramente parlare è come approcciare the algoritmic side of thigs . Ordunque, vi presento qui uno scoppiettante programmino per affrontare il forecasting delle serie storiche come dei veri pro.

Le basi

Il problema può essere affrontato in base a due variabili interdipendenti:

La finestra di predizione
Il lag delle variabili

La finestra di predizione rappresenta con quanto anticipo vogliamo predirre Y (e.g. se la predizione ha 3 mesi di anticipo, la finestra sarà di 3 mesi). Il lag delle variabili rappresenta invece la relazione temporale con cui le variabili indipendenti influenzano la dipendente. Ad esempio: le vendite di un’industria metallurgica saranno influenzate dall’andamento del PIL globale. Ma questa relazione non sarà immediata temporalmente: l’aumento di domanda del metallo sarà successivo di x mesi all’aumento del PIL.

Mentre la finestra di predizione è decisa dal Data Scientist e dal management, il lag delle variabile è fondamentalmente ignoto. Per trovarlo, in maniera ottimale, si ricorre alla Cross Validation. L’approccio che io propongo è univariato.

Restate con me, così vi spiego le funzioni per questo scopo.

L’oracolo algoritmico

Il programma è costituito da tre funzioni:

laggedDataMat: lagga il dataset (ossia “sposta” le variabili di un determinato lag)
bestLagRegr: trova il lag ottimale (univariato) per ogni variabile
modelSelection: prende l’outup di bestLagRegr ed esegue una grid search per trovare l’ottimale modello predittivo

Le librerie richieste per l’esecuzione sono:

import numpy as np, pandas as pd, pickle, copy
from sklearn.model_selection import cross_val_score, TimeSeriesSplit, GridSearchCV
from sklearn import ensemble
from sklearn.ensemble import GradientBoostingRegressor

Ecco qui di seguito le funzioni in Python:

laggedDataMat

def bestLagRegr(dataMat, maxLag, minLag, yName):

    # Purpose: identify the best lag for variables to be used in regression. Uses CrossValidation
    # (folds TimesSeriesSplit)

    # data - a panda DF with named columns
    # maxLag - the maximum lag to be tested. Integer
    # minLag - the minimum lag to be tested. Integer
    # yName - the name of the column of 'data' containing the dependent var. String

    data = dataMat.copy()
    colnames = [y for y in data.columns if y != yName]
    lags = range(minLag, maxLag)
    folds = TimeSeriesSplit(n_splits=int(np.round(data.shape[0] / 10, 0)))

    results = {}

    for col in colnames:

        scores = []
        lags_list = []

        for l in lags:
            varname = col + str('_lag') + str(l)
            data[varname] = data[col].shift(l)
            YX = data[[yName, varname]]
            YX = YX.dropna().as_matrix()

            # Build regressor and estimate metric of prediction performance with CV
            regr = ensemble.GradientBoostingRegressor(learning_rate=0.01, max_depth=1, n_estimators=500)
            perform = cross_val_score(regr, X=YX[:, 1].reshape(-1, 1), y=YX[:, 0], cv=folds,
                                      scoring='neg_median_absolute_error')

            # Store scores result and lags
            scores.append(np.median(perform))
            lags_list.append(l)

        # Calculate best score, corresponding best lag and store it in a dictionary, containing all colnames
        best_score = max(scores)
        best_lag = lags_list[scores.index(best_score)]
        results[col] = [best_lag, best_score]

    return(results)

bestLagRegr

def laggedDataMat(dataMat, yName, lagsDict):

    # Purpose: build a lagged DF
    # dataMat: the unlagged DF with named columns
    # yName: name of the dependent var. String
    # lagsDict: dictionary produced with 'bestLagRegr, containing:
    #                                                               - keys: column names of dataMat
    #                                                               - elements: lists with lag, CV score
    #
    # Output: a panda DataFrame with columns order sorted alphabetically

    # Initialize empty DF
    df = pd.DataFrame(index=dataMat.index)
    # Set dependent var
    df[yName] = dataMat[[yName]]

    # Creating and adding the lagged vars
    for colName in lagsDict.keys():
        l = lagsDict[colName][0]
        colNameLag = colName + str('_lag') + str(l)
        df[colNameLag] = dataMat[[colName]].shift(l)

    df = df.sort_index(axis=1)

    return(df)

modelSelection

def modelSelection(maxLag, data, depVar, toSave, pSpace = None, alpha = 0.95):

    if pSpace is None:
        pSpace = dict(n_estimators=list(range(5, 2000, 10)),
                      learning_rate=list(np.arange(0.001, 1, 0.1)),
                      max_depth=list(range(1, 3, 1)))

    lags = range(1, maxLag)
    results = dict()

    for lagMin in lags:
        print('Esimating model for lag: ', lagMin)
        lagAnalysis = bestLagRegr(data, maxLag, lagMin, depVar)
        lagMat = laggedDataMat(data, depVar, lagAnalysis)
        lagMat = lagMat.dropna()
        trainY = np.ravel(lagMat[depVar])
        lagMat = lagMat.drop([depVar], 1)

        folds = TimeSeriesSplit(n_splits=int(round(lagMat.shape[0] / 10, 0)))

        model = GradientBoostingRegressor(loss='ls')
        regr = GridSearchCV(estimator=model, param_grid=pSpace, scoring='neg_mean_squared_error', cv=folds)
        regr.fit(lagMat, trainY)

        modelName = toSave + '/' + 'modelOI' + '_lag' + str(lagMin) + '.sav'
        pickle.dump(regr.best_estimator_, open(modelName, 'wb'))
        temp = dict(BestModel=regr.best_estimator_, Score=regr.best_score_, Lags=lagAnalysis)

        regrQUpper = copy.deepcopy(regr.best_estimator_)
        regrQUpper.set_params(loss='quantile', alpha=alpha)
        regrQUpper.fit(lagMat, trainY)
        temp['QUpper'] = regrQUpper

        regrQLower = copy.deepcopy(regr.best_estimator_)
        regrQLower.set_params(loss='quantile', alpha=(1-alpha))
        regrQLower.fit(lagMat, trainY)
        temp['QLower'] = regrQLower

        key = 'Lag ' + str(lagMin)
        results[key] = temp

    return(results)

Calcolo stock minimi con R

Stimare lo stock minimo delle materie prime è una procedura difficile e suscettibile ad errori, soprattutto se eseguita con metodi euristici.

Per fortuna, la statistica ci salva ancora una volta, assicurandoci un livello ottimale di stock e massimizando i risparmi dell’inventario.

Il problema

Come si presenta il problema? Abbiamo 3 variabili da considerare:

Tempi di riordino (in questo esempio, 10 giorni di lead time)
Il consumo storico
Il Lean Manager che ti rincorre nel panico

Per l’esempio di questo post, il prodotto da stoccare saranno fogli di lamiera (misuarti in Kilogrammi), con un tempo di riordino di 10 giorni.

Exploratory Data analysis

*Fig. 1* Grafico consumo lamiera per la variabile temporale. La stagionalità è marcata

Un semplice grafico (Fig. 1) del consumo lamiera contro le decine dei giorni ci presenta una stagionalità abbastanza marcata: sarebbe dunque opportuno ottimizzare seguendo la variabile temporale, in modo da risparmiare sull’inventario.

*Fig. 2* Istogramma del consumo lamiera. L’approssimazione a distribuzione Gamma è chiaramente visibile.

Se eseguiamo un istogramma del consumo lamiera, la distribuzione è inoltre chiaramente approsimabile da una Gamma.
Siamo fortunati: possiamo già azzardare un modello.

Il modello

Come creare una stima per gli stock minimi? Un problema simile è facilmente risolvibile con il concetto di intervallo di confidenza (IC).

Possiamo quindi calcolare il consumo medio di lamiera data la variabile temporale, ossia la soluzione al problema risulta costruire un intervallo di confidenza (IC) intorno a:

$E[Y|(X,\omega)] = \mu_{Y|(X, \omega)}$

Dove:

Y è il consumo di metallo
X è la matrice delle variabili indipendenti (i.e. l’indice del periodo di 10 giorni)
ω è il codice prodotto (in questo caso solo un tipo di lamiera)
μ indica la media

Per stimare questa media, è necessario fittare un GLM con famiglia Gamma e link-function logaritmica:

$\Hat{y}_{i,k} = e^{\beta_{0} + \beta_{1}x_{i} + \beta_{2}\omega_{k}}$

Ricordo al caro lettore che a livello della link-function, la realtà è Normale. Dato ciò, costruire un IC α è banale: bisogna solamente eseguire il classico calcolo con la distribuzione t (perchè la dev std della popolazione non è nota, mascalzone!)

Il GLM (e il così costruito IC), ci stimano il consumo medio per periodo: dunque, il limite massimo di tale IC risulta essere la scelta migliore per lo stock minimo.

Ovviamente, la decisione del Lean Manager risulta ora concentrata sulla scelta del parametro α.

Ed ecco a voi il risultato:

Stock minimi lamiera: il livello minimo suggerito è rappresentato dalla linea rossa (95% di confidenza), mentre i punti verdi sono il consumo medio per periodo.

Per una trattazione più matematicamente rigorosa del problema, potete riferirvi al mio GitHub, dove troverete la documentazione del progetto. Qui in calce, invece, troverete il codice usato.

set.seed(123)


# Prototyping for Minimum stock of metal sheets

lamiera <- read.csv2("foo/Lamiera.csv", sep= ";", dec = ",")
lamiera <- lamiera[lamiera$Year >= '2014',] # Only considering years 2014 onwards. By EDA boxplot(quantity~Year), 2014 is the first year of greater variability

# Fitting model: glm, Gamma distr with log-link function
lmTest <- glm(Quantity~as.factor(TenDaysOfYear)+as.factor(PartNo), data = lamiera, family= Gamma(link= 'log'))


# Estimating CI and plotting

alpha <- 0.98 # Significance level

codiciLamiera <- unique(lamiera$PartNo)
for(codLamiera in codiciLamiera){
  
  
  preddata <- data.frame('TenDaysOfYear' = seq(1,36, by=1), 'PartNo' = codLamiera)
  preds <- predict(lmTest, newdata = preddata, type = "link", se.fit = TRUE)
  
  critval <- qt(alpha, nrow(lamiera)-1)
  upr <- preds$fit + (critval * preds$se.fit)
  lwr <- preds$fit - (critval * preds$se.fit)
  fit <- preds$fit
  fit2 <- lmTest$family$linkinv(fit)
  upr2 <- lmTest$family$linkinv(upr)
  lwr2 <- lmTest$family$linkinv(lwr)
  
  name <- paste('foo/Lamiera.csv','lamiera_', codLamiera,'.png', sep = '')
  
  png(filename = name)
  plot(Quantity~TenDaysOfYear, data= lamiera[lamiera$PartNo == codLamiera,], main=codLamiera)
  lines(upr2, col="red")
  points(fit2, col="green")
  dev.off()
  
  name <- paste('foo/Lamiera.csv','lamiera_', codLamiera,'.csv', sep = '')
  preddata$PredictedConsumption <- round(fit2, 2)
  preddata$MinimumStock <- round(upr2, 2)
  write.csv2(preddata, file = name, row.names = F)
}



# Model checks: all assumptions are well respected
plot(rstandard(lmTest), ylab = "Standardized residuals") # Super-nice looking plot: homoschedasticity respected
resOrdered <- rstandard(lmTest)[order(lamiera$TenDaysOfYear)]
plot(1:nrow(lamiera), resOrdered) # Independence respected
hist(rstandard(lmTest), main="Distribution of standardized residuals", xlab="Standardized residuals")
qqplot(rnorm(length(rstandard(lmTest))) ,rstandard(lmTest), xlab = "Std Normal sample", ylab = "Standardized residuals")
abline(c(0,1), col= "red")# Normality respected
plot(fitted(lmTest), residuals(lmTest, type = "pearson"), xlab = 'Fitted values', ylab = 'Pearson residuals') # mild pattern of pearson residuals, but can be overlooked due to big sample size
# Linearity: respected by construction (i.e. scale of link is always linear)

Business Analytics con retaileR (parte II)

Ecco la seconda parte dell’articolo sul nuovo pacchetto R retaileR. La prima parte potete trovarla qui.

Le Funzioni

closure.opp.cost

closure.opp.cost serve per stimare il costo di opportunità in funzione del tempo di chiusura dei locali di retail. Ossia, una stima del valore monetario perso a seconda dell’orario di chiusura.

I suoi argomenti sono:

Utilizzando il dataset preformattato, ecco un esempio:

closure_opp_cost("16:15:00", X= sales_august)

e l’output:

$Mean
[1] 10.35812

$Median
[1] 10.35

$StDev
[1] 8.392746

$`Late Sales`
[1] 10.30 18.90 6.30 26.70 14.60 10.60 0.05 24.18 10.40 9.60 14.10 14.90 5.10 0.00 0.00 0.00

In questo caso, chiudere il negozio alle 16:15:00 comporterebbe una perdita media di 10.35 €, con una varianza di 8.39 €. Inoltre, dal vettore Late.sales si può vedere chiaramente come 3 giorni non ci siano state vendite dopo le 16:15:00.

segment_prod_line

Questa funzione è stata creata per l’analisi delle diverse famiglie di prodotti secondo un determinato segmento temporale.

Ossia, se la variabile temporale del dataset input è giornaliera, segment_prod_line produrrà le vendite medie giornaliere per la product line di interesse.

Vediamone il funzionamento:

specifichiamo la linea di prodotti come un vettore di stringhe:

prods &lt;- c("Americano", "Espresso")

e successivamente evochiamo segment_prod_line

segment_prod_line(sales_august, prods)

ecco l’output:

[1] 31.56065

ciò significa che nei giorni che il dataset copre, abbiamo venduto in media 31.5 Americani ed Espressi al giorno.

Mean.items.times

Per ultima, una funzione di analisi temporale delle vendite. mean.items.times calcola le vendite medie (in volume) per ogni ora. Ossia crea un istogramma con l’unità temporale sull’asse delle ascisse e le vendite medie sull’asse delle ordinate.

Ecco il funzionamento (che ricordo, è ottimizzato per un dataset iZettle):

 mean.items.times(sales_august)

e l’output corrispondente:

       Time  Quantity
1  05:00:00  1.000000
2  06:00:00  9.000000
3  07:00:00 23.100000
4  08:00:00 30.695652
5  09:00:00 24.826087
6  10:00:00 17.695652
7  11:00:00 16.000000
8  12:00:00 11.173913
9  13:00:00 10.608696
10 14:00:00  8.590909
11 15:00:00  7.714286
12 16:00:00  6.000000
13 17:00:00  2.000000
14 18:00:00  1.000000

‘Un contributo alla statistica’ di Wislawa Szymborska

No comment: semplicemente una poesia trovata recentemente. A base di statistica ovviamente. Buona lettura.

Un contributo alla statistica di Wislawa Szymborska

Su cento persone:

che ne sanno sempre piu’ degli altri
– cinquantadue;

insicuri a ogni passo
– quasi tutti gli altri;

pronti ad aiutare,
purche’ la cosa non duri molto
– ben quarantanove;

buoni sempre,
perche’ non sanno fare altrimenti
– quattro, be’, forse cinque;

propensi ad ammirare senza invidia
– diciotto;

viventi con la continua paura
di qualcuno o qualcosa
– settantasette;

dotati per la felicita’
– al massimo poco piu’ di venti;

innocui singolarmente,
che imbarbariscono nella folla
– di sicuro piu’ della meta’;

crudeli,
se costretti dalle circostanze
– e’ meglio non saperlo
neppure approssimativamente;

quelli col senno di poi
– non molti di piu’
di quelli col senno di prima;

che dalla vita prendono solo cose
– quaranta,
anche se vorrei sbagliarmi;

ripiegati, dolenti
e senza torcia nel buio
– ottantatre’
prima o poi;

degni di compassione
– novantanove;

mortali
– cento su cento.
Numero al momento invariato.

Business analytics con retaileR (parte I)

Perdonate la lunga assenza: sono stato molto impegnato in questi ultimi mesi. Tuttavia, per farmi perdonare, ho sviluppato una chicca tutta per voi:

un nuovissimo pacchetto R.

L’idea di creare un pacchetto R mi ha sempre affascinato. In fondo, è proprio grazie ai contributi Open Source che questo linguaggio continua ad arricchirsi. Tuttavia, mi è sempre mancato un ambito di applicazione, ossia non ho mai immaginato per cosa costruire il pacchetto. Per fortuna, negli ultimi mesi ho partecipato ad un progetto che mi ha introdotto al mondo del retail (il commercio Business to Consumer per intenderci) ed ai suoi peculiari problemi: da qui è nata l’idea di costruire una serie di funzioni che aiutino l’analisi dei dati in queso settore.

Il pacchetto, sviluppato in R, è chiamato retaileR e potete trovarlo sul mio Github. Nei prossimi mesi cercherò di renderlo disponibile sul CRAN, la repository ufficiale di tutti i pacchetti R.

Che cosa può fare retaileR? Retailer è nato con l’idea di creare un gruppo di funzioni utili per l’analisi dei dati di vendita. In questo modo, il processo decisionale verrebbe reso più semplice e con stime più efficaci. retaileR può aiutare nel decidere l’orario di chiusura di una filiale, organizzare lo staffing individuando gli orari di massimo/minimo volume e molte altre funzioni simili.

In questa serie di articoli illustrerò le funzioni e provvederò degli esempi su come utilizzare questo nuovo pacchetto.

I Dati

Il dataset di base viene considerato quello creato in automatico da iZettle, un sistema POS per il commercio al dettaglio. Qui sotto le prime righe di un dataset illustrativo (che userò negli esempi successivi) e che potrete trovare sempre in GitHub.

Carichiamo il dataset delle vendite fittizie relative ad un mese e visualizziamone le prime righe.

sales_august &amp;lt;- read.csv(“foo/sales_august.csv”)
head(sales_august)

Ecco l’output:

        Date                Time Receipt.number       Name    Variant Unit Quantity Price..GBP. Discount..GBP. Final.price..GBP.
1 2017-08-01 2017-08-20 06:19:00           7359  Americano              NA        1         2.5           0.00              2.50
2 2017-08-01 2017-08-20 06:19:00           7359   Espresso              NA        1         2.0           0.00              2.00
3 2017-08-01 2017-08-20 06:28:00           7360      Mocha              NA        1         3.1           1.55              1.55
4 2017-08-01 2017-08-20 06:43:00           7361 Flat White              NA        2         5.4           0.00              5.40
5 2017-08-01 2017-08-20 06:52:00           7362  Macchiato              NA        1         2.2           0.00              2.20
6 2017-08-01 2017-08-20 06:52:00           7362 Extra Shot Extra shot   NA        1         0.2           0.00              0.20

Le Funzioni

sales.format

questa funzione si propone come l’inizio di ogni analisi con retaileR poiché la sua funzione è formattare correttamente i dati da analizzare. In particolare, essa agisce sulle variabili di tempo (i.e. data e ora) formattandole in maniera uniforme.

Ecco un esempio:

sales_august <- sales.format(sales_sept)
[\code]

I suoi argomenti sono: X, date_var= “Date”, time_var= “Time”, format_time = ‘%H:%M’, dove:

X: indica il data frame contenente i dati da analizzare.
date_var è il nome, come stringa, della variabile contenente la data. Default a “Date”.
time_var idem come date_var ma per il tempo (ossia l’ora). Default a “Time”.
format_time come deve essere formattato il tempo. Per usare retaileR è necessario il formato ‘ora : minuto’ indicato appunto dal default ‘%H:%M’.

as.sales

l’analisi dei dati di vendita crea particolari problemi a livello della variabile temporale. Infatti, essendo una serie temporale, le vendite possono essere raggruppate seguendo tempi diversi: ore, giorni, settimane etc.

Inoltre, non è semplice capire se il di crescita sia positivo o negativo: la stagionalità e i movimenti casuali rendono molto spesso difficile questo compito.

Per evitare questi due problemi, la funzione as.sales crea una nuova classe di oggetti (in S4) che include una semplice regressione lineare e diversi raggruppamenti temporali.

i suoi argomenti sono:

X: indica il data frame contenente i dati da analizzare.
date_var è il nome, come stringa, della variabile contenente la data. Default a “Date”.
time_var idem come date_var ma per il tempo (ossia l’ora). Default a “Time”.
sales_var è il nome, come stringa, della variabile contenete il valore monetario di ogni singola vendita. Defaulta a “Final.price..GBP.”.
format_time come deve essere formattato il tempo. Per usare retaileR è necessario il formato ‘ora : minuto’ indicato appunto dal default ‘%H:%M’.
discount_var, nome colonna contenente gli sconti. Il default è “Discount..GBP.”.

Come output, la funzione produce un oggetto S4 sales, la cui struttura è la seguente:


ecco l'output:

 with 5 slots
  ..@ Daily           :'data.frame':	16 obs. of  2 variables:
  .. ..$ Date             : Date[1:16], format: "2017-08-01" "2017-08-02" "2017-08-03" "2017-08-04" ...
  .. ..$ Final.price..GBP.: num [1:16] 287 268 280 342 337 ...
  ..@ Weekly          :'data.frame':	3 obs. of  2 variables:
  .. ..$ Week             : num [1:3] 31 32 33
  .. ..$ Final.price..GBP.: num [1:3] 1514 2757 1565
  ..@ Monthly         :'data.frame':	1 obs. of  2 variables:
  .. ..$ Month            : num 8
  .. ..$ Final.price..GBP.: num 5836
  ..@ Discount        : num 39.8
  ..@ sales_funct_time:List of 12
  .. ..$ coefficients : Named num [1:2] -201711.1 11.6
  .. .. ..- attr(*, "names")= chr [1:2] "(Intercept)" "Date"
  .. ..$ residuals    : Named num [1:16] 9.79 -21.28 -20.64 29.43 13.04 ...
  .. .. ..- attr(*, "names")= chr [1:16] "1" "2" "3" "4" ...
  .. ..$ effects      : Named num [1:16] -1459 214.3 -18.6 30.4 13.1 ...
  .. .. ..- attr(*, "names")= chr [1:16] "(Intercept)" "Date" "" "" ...
  .. ..$ rank         : int 2
  .. ..$ fitted.values: Named num [1:16] 278 289 301 312 324 ...
  .. .. ..- attr(*, "names")= chr [1:16] "1" "2" "3" "4" ...
  .. ..$ assign       : int [1:2] 0 1
  .. ..$ qr           :List of 5
  .. .. ..$ qr   : num [1:16, 1:2] -4 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:16] "1" "2" "3" "4" ...
  .. .. .. .. ..$ : chr [1:2] "(Intercept)" "Date"
  .. .. .. ..- attr(*, "assign")= int [1:2] 0 1
  .. .. ..$ qraux: num [1:2] 1.25 1.27
  .. .. ..$ pivot: int [1:2] 1 2
  .. .. ..$ tol  : num 1e-07
  .. .. ..$ rank : int 2
  .. .. ..- attr(*, "class")= chr "qr"
  .. ..$ df.residual  : int 14
  .. ..$ xlevels      : Named list()
  .. ..$ call         : language lm(formula = as.formula(paste(sales_var, "~", date_var)), data = day)
  .. ..$ terms        :Classes 'terms', 'formula'  language Final.price..GBP. ~ Date
  .. .. .. ..- attr(*, "variables")= language list(Final.price..GBP., Date)
  .. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. ..$ : chr [1:2] "Final.price..GBP." "Date"
  .. .. .. .. .. ..$ : chr "Date"
  .. .. .. ..- attr(*, "term.labels")= chr "Date"
  .. .. .. ..- attr(*, "order")= int 1
  .. .. .. ..- attr(*, "intercept")= int 1
  .. .. .. ..- attr(*, "response")= int 1
  .. .. .. ..- attr(*, ".Environment")=<environment: 0x102165000> 
  .. .. .. ..- attr(*, "predvars")= language list(Final.price..GBP., Date)
  .. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "other"
  .. .. .. .. ..- attr(*, "names")= chr [1:2] "Final.price..GBP." "Date"
  .. ..$ model        :'data.frame':	16 obs. of  2 variables:
  .. .. ..$ Final.price..GBP.: num [1:16] 287 268 280 342 337 ...
  .. .. ..$ Date             : Date[1:16], format: "2017-08-01" "2017-08-02" "2017-08-03" "2017-08-04" ...
  .. .. ..- attr(*, "terms")=Classes 'terms', 'formula'  language Final.price..GBP. ~ Date
  .. .. .. .. ..- attr(*, "variables")= language list(Final.price..GBP., Date)
  .. .. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. .. ..$ : chr [1:2] "Final.price..GBP." "Date"
  .. .. .. .. .. .. ..$ : chr "Date"
  .. .. .. .. ..- attr(*, "term.labels")= chr "Date"
  .. .. .. .. ..- attr(*, "order")= int 1
  .. .. .. .. ..- attr(*, "intercept")= int 1
  .. .. .. .. ..- attr(*, "response")= int 1
  .. .. .. .. ..- attr(*, ".Environment")=<environment: 0x102165000> 
  .. .. .. .. ..- attr(*, "predvars")= language list(Final.price..GBP., Date)
  .. .. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "other"
  .. .. .. .. .. ..- attr(*, "names")= chr [1:2] "Final.price..GBP." "Date"
  .. ..- attr(*, "class")= chr "lm"

Come si può osservare, l’oggetto creato da as.sales contiene 5 sub-oggetti:

3 raggruppamenti di tempo: daily, weekly, monthly.
Un valore dei discount totali nel dataset.
Una oggetto lm di regressione lineare, utile per comprendere il trend.