<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="http://vulgairedev.fr/feed.xml" rel="self" type="application/atom+xml" /><link href="http://vulgairedev.fr/" rel="alternate" type="text/html" /><updated>2025-10-29T15:35:20+00:00</updated><id>http://vulgairedev.fr/feed.xml</id><title type="html">VulgaireDev</title><subtitle>Blog tech d&apos;un Vulgaire Dev.
</subtitle><author><name>Romain Mathonat</name><email>romain.mathonat@gmail.com</email></author><entry><title type="html">Gestion de dépendances et environnements virtuels python en 2024</title><link href="http://vulgairedev.fr/2024/08/26/python-deps.html" rel="alternate" type="text/html" title="Gestion de dépendances et environnements virtuels python en 2024" /><published>2024-08-26T00:00:00+00:00</published><updated>2024-08-26T00:00:00+00:00</updated><id>http://vulgairedev.fr/2024/08/26/python-deps</id><content type="html" xml:base="http://vulgairedev.fr/2024/08/26/python-deps.html"><![CDATA[<p>Si vous avez déjà eu un des problèmes suivants, ce guide va vous aider (voir vous sauver):</p>

<ul>
  <li>J’ai installé une nouvelle lib avec pip install <ma_lib> et maintenant tout est cassé</ma_lib></li>
  <li>J’ai un venv que je veux reproduire en prod, comment faire ?</li>
  <li>J’ai installé python 3.10.2 et 3.11.2 mais je ne comprend pas comment utiliser l’un ou l’autre ?</li>
  <li>C’est quoi ce tas de trucs incomprehensibles avec pip, easy_install, poetry, conda, virtualenv, pdm ?</li>
  <li>Pourquoi des fois je vois des <a href="http://setup.py">setup.py</a>, et parfois des pyproject.toml ? Pourquoi des lockfiles pdm.lock ou poetry.lock ?</li>
  <li>J’ai mon venv qui marche niquel, avec pleins de belles librairies de data science,  je fais un pip install d’une nouvelle lib et paf, tout est cassé, ça marche pas je ne sais pas pourquoi ?</li>
  <li>JE VEUX JUSTE UN TRUC SIMPLE QUI ME PERMETTE DE DEV DANS DES NOTEBOOKS DE MANIERE PRAGMATIQUE COMMENT JE FAIS ?</li>
</ul>

<h2 id="pré-requis">Pré-requis:</h2>

<p>Avoir soit un linux sur lequel on peut installer ce qu’on veut, soit un wsl sous windows pour pouvoir travailler efficacement (l’install est simple désormais: <a href="https://learn.microsoft.com/en-us/windows/wsl/install">https://learn.microsoft.com/en-us/windows/wsl/install</a>), avoir un python d’installé (il y en a normalement un par défaut dans la plupart des linux couramment utilisés), ainsi que pip (idem)</p>

<h2 id="jai-installé-une-nouvelle-lib-avec-pip-install-pandas-et-maintenant-tout-est-cassé">J’ai installé une nouvelle lib avec pip install pandas et maintenant tout est cassé</h2>

<p>Lorsqu’on fait un:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>pandas
</code></pre></div></div>

<p>directement dans le terminal sans autre précaution, on installe une librairie dans le système global, ça pose problème:</p>

<ul>
  <li>Si on a un autre projet qui a besoin d’une autre version de pandas que ce projet courant, on ne pourra pas faire co-exister les deux</li>
  <li>On risque de modifier la version de pandas utilisée par l’autre projet (et donc le casser)</li>
</ul>

<p>Dans ces deux cas on peut avoir des messages d’erreur parfois un peu obscurs, nous disant qu’il y a conflit.</p>

<p>Pour résoudre ce problème, en python on utilise des “<strong>environnements virtuels</strong>”, c’est à dire un mécanisme qui permet d’isoler les version de python et les dépendances.</p>

<p>Généralement, on va vouloir avoir un environnement virtuel (ou venv) par projet.</p>

<p>Pour en créer un:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-m</span> venv projet_a_env
</code></pre></div></div>

<p>Puis on l’active (on se “met dedans”)</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">source </span>projet_a_env/bin/activate
</code></pre></div></div>

<p>On peut alors simplement installer les librairies qu’on souhaite:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>pandas
</code></pre></div></div>

<p><img src="/assets/images/venv_python/image.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Si on veut sortir de l’environnement virtuel:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>deactivate
</code></pre></div></div>

<p>On revient alors dans le système global, pandas n’est plus installé:</p>

<p><img src="/assets/images/venv_python/image%201.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<h2 id="jai-un-venv-que-je-veux-reproduire-en-prod-comment-faire-">J’ai un venv que je veux reproduire en prod, comment faire ?</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip freeze <span class="o">&gt;</span> requirements.txt
</code></pre></div></div>

<p>Le fichier requirements.txt est ici de la forme:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>numpy==1.24.4
pandas==2.0.3
python-dateutil==2.9.0.post0
pytz==2024.1
six==1.16.0
tzdata==2024.1
</code></pre></div></div>

<p>Puis une fois dans le nouveau venv de prod :</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install</span> <span class="nt">-r</span> requirements.txt
</code></pre></div></div>

<h2 id="jai-installé-python-3102-et-3112-mais-je-ne-comprend-pas-comment-utiliser-lun-ou-lautre-">J’ai installé python 3.10.2 et 3.11.2 mais je ne comprend pas comment utiliser l’un ou l’autre ?</h2>

<p>On peut vite se retrouver à devoir gérer plusieurs versions de python sur la même machine (dépendamment des projets sur lesquels on travaille). Pour gérer ce problème, le plus simple est d’installer et d’utiliser pyenv:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl https://pyenv.run | bash
</code></pre></div></div>

<p>puis:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s1">'export PYENV_ROOT="$HOME/.pyenv"'</span> <span class="o">&gt;&gt;</span> ~/.bashrc
<span class="nb">echo</span> <span class="s1">'command -v pyenv &gt;/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"'</span> <span class="o">&gt;&gt;</span> ~/.bashrc
<span class="nb">echo</span> <span class="s1">'eval "$(pyenv init -)"'</span> <span class="o">&gt;&gt;</span> ~/.bashrc
</code></pre></div></div>

<p>(si vous utilisez zsh ou un autre shell, voir la doc https://github.com/pyenv/pyenv)</p>

<p>Ensuite on va installer les dépendances nécessaires (ici sur ubuntu/debian) pour pouvoir compiler d’autres versions de python:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt update
<span class="nb">sudo </span>apt <span class="nb">install</span> <span class="nt">-y</span> make build-essential libssl-dev zlib1g-dev <span class="se">\</span>
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm <span class="se">\</span>
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
</code></pre></div></div>

<p>Ensuite, c’est simple, pour installer une version de python sur le système:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pyenv <span class="nb">install </span>3.10.2
</code></pre></div></div>

<p>Puis pour switcher vers une version de python spécifiquement pour le projet courant:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pyenv <span class="nb">local </span>3.10.2
</code></pre></div></div>

<p>Ceci créé un fichier .python-version, qui contient donc la version de python du projet.</p>

<p>Quand vous créérez un venv, il sera automatiquement utilisé.</p>

<h2 id="cest-quoi-ce-tas-de-trucs-incomprehensibles-avec-pip-easy_install-poetry-conda-virtualenv-pdm-">C’est quoi ce tas de trucs incomprehensibles avec pip, easy_install, poetry, conda, virtualenv, pdm ?</h2>

<ul>
  <li><strong>pip</strong>: gestionnaire de paquets standard de python</li>
  <li><strong>easy_install</strong>: un ancien outil d’installation de paquet, une autre époque, oublie</li>
  <li><strong>poetry</strong>: un gestionnaire de paquet plus moderne qui résout pas mal de problemes de pip, notamment la gestion de conflits de paquets dans un venv, ainsi que la gestion de dépendances primaires et secondaires (voir après).</li>
  <li><strong>conda</strong>: Gestionnaire de paquet et d’environnements pythons, populaire dans la communauté scientifique. J’ai eu plusieurs soucis d’exports windows/linux pour reproduire des environnements condas, ne suit pas vraiment les standards python, utilise son propre mécanisme d’environnements virtuels. Peut être bien dans un cadre scientifique, mais pas dans les cadres industriels que j’ai rencontrés.</li>
  <li><strong>pdm</strong>: Comme poetry, en mieux: respecte les standards python, permet d’avoir plus de contrôle sur la construction et le publish des wheels, notamment.</li>
</ul>

<h2 id="pourquoi-des-fois-je-vois-des-setuppy-et-parfois-des-pyprojecttoml--pourquoi-des-lockfiles-pdmlock-ou-poetrylock-">Pourquoi des fois je vois des <a href="http://setup.py">setup.py</a>, et parfois des pyproject.toml ? Pourquoi des lockfiles pdm.lock ou poetry.lock ?</h2>

<ul>
  <li>setup.py: permet d’installer le projet courant comme paquet (pip l’utilisait quand on faisait pip install <ma_lib>). C’est l’ancienne façon de faire, à oublier</ma_lib></li>
  <li>pyproject.toml: standard depuis 2016 qui remplace le setup.py, et est le point de configuration central du projet, i.e., quelles sont les dependances primaires, de dev, comment build le projet, comment le distribuer, quelles sont les metadonnées, etc.</li>
</ul>

<p>Lorsqu’on fait un “pip install pandas” et qu’on liste les dépendances, on a vu plus haut que ça nous donne une liste de toutes les librairies installées, c’est à dire à la fois les dépendances primaires, et les dépendances secondaires, si bien qu’on peut se retrouver perdu à ne plus s’y retrouver. Avec le pyprojet.toml, on obtient une liste de toutes les dépendances primaires:</p>

<p><img src="/assets/images/venv_python/image%202.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Dans le fichier pdm.lock ou poetry.lock on va avoir les dépendances secondaires, c’est à dire les dépendances de dépendances. Elles vont être utiles pour pouvoir recréer le venv à l’identique ailleurs:</p>

<p><img src="/assets/images/venv_python/image%203.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<h2 id="jai-mon-venv-qui-marche-bien-avec-pleins-de-belles-librairies-pour-la-data-science--je-fais-un-pip-install-dune-nouvelle-lib-et-paf-tout-est-cassé-ça-marche-pas--">J’ai mon venv qui marche bien, avec pleins de belles librairies pour la data science,  je fais un pip install d’une nouvelle lib et paf, tout est cassé, ça marche pas  ?</h2>

<p>Il y a en fait deux cas:</p>

<ul>
  <li>Ca casse notre env car il y a une incompatibilité fondamentale et ça met le bazar sans nous dire pourquoi, c’est moche</li>
</ul>

<p><img src="/assets/images/venv_python/image%204.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p><img src="/assets/images/venv_python/image%205.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Ici on installe d’abord seaborn puis numpy dans une version incompatible. Seaborn ne fonctionne plus: on a certes un message d’erreur, mais pas de rollback, le venv est dans état non fonctionnel pour notre code.</p>

<ul>
  <li>Ca upgrade automatiquement une lib précédente, qu’on avait pourtant fixée :</li>
</ul>

<p>Par exemple on installe numpy dans une version 1.1.15, on fait notre code, puis plus tard on installe seaborn. A ce moment là, numpy va automatiquement être mis à jour vers une version plus récente (sans nous demander notre avis), et donc potentiellement ne sera plus compatible avec notre ancien code.</p>

<p>En fait dans ces deux cas nous sommes dans des exemples de “dependency hell”:</p>

<p><img src="/assets/images/venv_python/dependency_hell.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Ici on est dans un cas simple, mais imaginez la complexité lorsqu’on a des dépendances au 5ème degré, avec des dizaines de librairies…</p>

<p>Pour mieux gérer ces problèmes, qui peuvent vite vous faire perdre des années de vie, je conseille d’utiliser pdm (<a href="https://pdm-project.org/en/latest/">https://pdm-project.org/en/latest/</a>), et ce pour plusieurs raisons:</p>

<ul>
  <li>Le rollback fonctionne, si on essaie d’installer quelque chose, on revient dans l’état du venv précédent qui lui était fonctionnel</li>
  <li>Lorsqu’on essaie d’installer une librairie où il y a un conflit, on a un message clair qui nous guide pour trouver une solution</li>
  <li>On a une séparation claire des dépendances primaires et secondaires</li>
  <li>On respecte les normes pep strictement (ce qui n’est pas le cas de poetry, par exemple)</li>
  <li>On peut choisir la façon de build ses wheels pour distribuer son programme, contrairement à poetry qui force son propre outil de build (qui a déjà été bloquant pour moi)</li>
  <li>Il est assez efficace</li>
</ul>

<h2 id="je-veux-juste-un-truc-simple-qui-me-permette-de-dev-dans-des-notebooks-de-maniere-pragmatique-comment-je-fais-">JE VEUX JUSTE UN TRUC SIMPLE QUI ME PERMETTE DE DEV DANS DES NOTEBOOKS DE MANIERE PRAGMATIQUE COMMENT JE FAIS ?</h2>

<p>Si on est sous windows, installation du wsl en administrateur, le powershell on oublie:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wsl <span class="nt">--install</span>
</code></pre></div></div>

<p>Installation de pdm:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>python3.10-venv
curl <span class="nt">-sSL</span> https://pdm-project.org/install-pdm.py | python3 -
<span class="nb">echo export </span><span class="nv">PATH</span><span class="o">=</span>/home/romain/.local/bin:<span class="nv">$PATH</span> <span class="o">&gt;&gt;</span> ~/.bashrc
<span class="nb">source</span> ~/.bashrc
</code></pre></div></div>

<p>Installation de pyenv et choix de la version de python:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl https://pyenv.run | bash
<span class="nb">echo</span> <span class="s1">'export PYENV_ROOT="$HOME/.pyenv"'</span> <span class="o">&gt;&gt;</span> ~/.bashrc
<span class="nb">echo</span> <span class="s1">'command -v pyenv &gt;/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"'</span> <span class="o">&gt;&gt;</span> ~/.bashrc
<span class="nb">echo</span> <span class="s1">'eval "$(pyenv init -)"'</span> <span class="o">&gt;&gt;</span> ~/.bashrc

<span class="nb">sudo </span>apt update
<span class="nb">sudo </span>apt <span class="nb">install</span> <span class="nt">-y</span> make build-essential libssl-dev zlib1g-dev <span class="se">\</span>
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm <span class="se">\</span>
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev

pyenv <span class="nb">install </span>3.11.2
pyenv <span class="nb">local </span>3.11.2
</code></pre></div></div>

<p>Creation du dossier de travail</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir </span>my_project
</code></pre></div></div>

<p>Creation de l’environnement virtuel (répondre aux questions qui apparaissent pour initialiser le projet)</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pdm init
</code></pre></div></div>

<p>Ajout de librairie(s)</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pdm add pandas
</code></pre></div></div>

<p>Ajout de ipykernel pour pouvoir utiliser notre venv directement dans notre notebook</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pdm add ipykernel
</code></pre></div></div>

<p>Attention: pdm créée par defaut le venv dans .venv, faire un “ls - a” pour le voir</p>

<p>On peut alors sélectionner l’environnement virtuel directement dans le notebook, dans vscode par exemple.</p>]]></content><author><name>Romain Mathonat</name><email>romain.mathonat@gmail.com</email></author><summary type="html"><![CDATA[Si vous avez déjà eu un des problèmes suivants, ce guide va vous aider (voir vous sauver):]]></summary></entry><entry><title type="html">VS Code python</title><link href="http://vulgairedev.fr/2023/10/26/vscode.html" rel="alternate" type="text/html" title="VS Code python" /><published>2023-10-26T00:00:00+00:00</published><updated>2023-10-26T00:00:00+00:00</updated><id>http://vulgairedev.fr/2023/10/26/vscode</id><content type="html" xml:base="http://vulgairedev.fr/2023/10/26/vscode.html"><![CDATA[<h2 id="extensions">Extensions</h2>
<ul>
  <li>Python (wich should install Pylance for static type checking)</li>
  <li>Ruff, one tool for linting, black formatting, isort etc</li>
  <li>Jupyter (Keymap, Slide show, cell tags)</li>
  <li>Coverage gutters for code coverage</li>
  <li>Remote ssh/explorer to connect to distant servers easily (no need for jupyther hub now)</li>
  <li>GitLens to supercharge git</li>
  <li>Vim</li>
  <li>Material icon theme</li>
  <li>Theme (publisher:”Mhammed Talhaouy”), personnal preference</li>
</ul>

<h2 id="configuration-file">Configuration file</h2>
<p>Here is my config file,  useful to handle copy paster in vim, or autolaunch ruff on save for example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
	"editor.minimap.enabled": false,  
	// Bracket-pair colorization  
	"editor.bracketPairColorization.enabled": false,  
	"notebook.diff.ignoreMetadata": true,  
	"gitlens.hovers.currentLine.over": "line",  
	"workbench.iconTheme": "material-icon-theme",  
	"files.autoSave": "afterDelay",  
	"git.confirmSync": false,  
	"jupyter.notebookFileRoot": "${workspaceFolder}",  
	"vim.commandLineModeKeyBindingsNonRecursive": [],  
	"vim.useSystemClipboard": true,  
	"vim.handleKeys": {  
		"&lt;C-c&gt;": false,  
		"&lt;C-v&gt;": false,
		"&lt;C-j&gt;": false  
	},  
	"vim.visualModeKeyBindingsNonRecursive": [  
	{  
		"before": [  
			"p",  
		],  
		"after": [  
			"p",  
			"g",  
			"v",  
			"y"  
		]  
	}  
	],  
	"[python]": {  
		"editor.formatOnSave": true,  
		"editor.codeActionsOnSave": {  
		"source.organizeImports": true,  
		"source.fixAll": true  
		},  
		"editor.formatOnType": true,
 		"editor.defaultFormatter": "charliermarsh.ruff"
	},  
	"python.analysis.typeCheckingMode": "basic",  
	"python.analysis.autoImportCompletions": true,  
	"python.analysis.stubPath": "",  
	"python.analysis.indexing": true,  
	"python.terminal.activateEnvironment": false,  
	"workbench.colorTheme": "Theme",  
	"gitlens.views.branches.branches.layout": "list",  
	"explorer.confirmDragAndDrop": false,  
	"files.exclude": {  
	"**/__pycache__": true,  
	"**/.pytest_cache": true  
	},  
	"python.analysis.inlayHints.pytestParameters": true,  
	"pythonTestExplorer.testFramework": "pytest",  
	"python.analysis.inlayHints.functionReturnTypes": false,  
}  
</code></pre></div></div>

<h2 id="testing">Testing</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>poetry add pytest pytest-cov
</code></pre></div></div>
<p>The first one will give a coverage visible directly in vs code, the other one a report inside the terminal</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pytest <span class="nb">.</span> <span class="nt">--cov-report</span> xml:cov.xml <span class="nt">--cov</span> <span class="nb">.</span>
pytest <span class="nb">.</span> <span class="nt">--cov-report</span> term <span class="nt">--cov</span> <span class="nb">.</span>
</code></pre></div></div>

<h2 id="profiling">Profiling</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>poetry add py-spy
</code></pre></div></div>
<p>You can then launch the py-spy to sample the running process and get a nice svg visualization:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>py-spy record <span class="nt">--pid</span> 1400174 <span class="nt">--format</span> speedscope <span class="nt">-r</span> 1000
</code></pre></div></div>]]></content><author><name>Romain Mathonat</name><email>romain.mathonat@gmail.com</email></author><summary type="html"><![CDATA[Extensions Python (wich should install Pylance for static type checking) Ruff, one tool for linting, black formatting, isort etc Jupyter (Keymap, Slide show, cell tags) Coverage gutters for code coverage Remote ssh/explorer to connect to distant servers easily (no need for jupyther hub now) GitLens to supercharge git Vim Material icon theme Theme (publisher:”Mhammed Talhaouy”), personnal preference]]></summary></entry><entry><title type="html">Git Cheatsheet</title><link href="http://vulgairedev.fr/2023/10/16/git-cheatsheet.html" rel="alternate" type="text/html" title="Git Cheatsheet" /><published>2023-10-16T00:00:00+00:00</published><updated>2023-10-16T00:00:00+00:00</updated><id>http://vulgairedev.fr/2023/10/16/git-cheatsheet</id><content type="html" xml:base="http://vulgairedev.fr/2023/10/16/git-cheatsheet.html"><![CDATA[<h2 id="local-branch-creation">Local branch creation</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git checkout main
git pull
git checkout <span class="nt">-b</span> illustration_workflow
</code></pre></div></div>
<h2 id="commit">Commit</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git add compute_sessions.py
git commit <span class="nt">-m</span> <span class="s2">"illustration du workflow"</span>
</code></pre></div></div>

<h2 id="revert">Revert</h2>
<p>Add a commit that reverts previous changes (that way we can create a new tag to deploy anew a previous version)</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git revert HEAD
</code></pre></div></div>

<h2 id="rebase-on-main">Rebase on main</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git checkout main
git pull
git checkout illustration_workflow
git rebase main
</code></pre></div></div>
<p>We can also use git fomo (see below aliases)
If we have conflicts, handle them manually, then</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git push -f
</code></pre></div></div>

<h2 id="push">Push</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git push <span class="nt">-u</span> <span class="nt">--force-with-lease</span> origin illustration_workflow
</code></pre></div></div>
<p>We can configure git to directly send the current branch without having to specify those arguement (simple “git push”):</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git config <span class="nt">--add</span> <span class="nt">--bool</span> push.autoSetupRemote <span class="nb">true</span>
</code></pre></div></div>
<p>or for older git version:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git config <span class="nt">--global</span> push.default current
</code></pre></div></div>

<h2 id="reseting">Reseting</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git reflog
git reset HEAD@<span class="o">{</span>index<span class="o">}</span>
</code></pre></div></div>
<p>git reset –hard HEAD to delete permanently modifications made after HEAD</p>

<h2 id="add-a-small-fix-to-last-commit">Add a small fix to last commit</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git commit <span class="nt">-am</span> <span class="nt">--amend</span> <span class="nt">--no-edit</span>
</code></pre></div></div>
<p><strong>Warning</strong> Only on local commit that have not been pushed !
To change only the commit message:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git commit <span class="nt">--amend</span>
</code></pre></div></div>

<h2 id="move-the-last-commit-from-main-to-another-branch">Move the last commit from main to another branch</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git checkout new_branch
git cherry-pick master
git checkout master
git reset HEAD~ <span class="nt">--hard</span>
</code></pre></div></div>

<h2 id="aliases">Aliases</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[alias]
     fomo = !git fetch origin main &amp;&amp; git rebase origin/main
     ci = commit
     co = checkout
     st = status -sb
     sts = status -s
     br = branch
     tip = log -n 1 --abbrev-commit --decorate
     lol = log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)&lt;%an&gt;%Creset' --abbrev-commit
     lola = log --graph --decorate --pretty=oneline --abbrev-commit --all
     unstage = reset HEAD
     cp = cherry-pick
     cam = commit -am
     last = log -1 --stat
     cl = clone
     dc = diff --cached
     lg = log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %Cblue&lt;%an&gt;%Creset' --abbrev-commit --date=relative --all
     dt = diff-tree --no-commit-id --name-only -r
     pushf = push --force-with-lease
     last = log -1 --stat
     oups = commit --amend --no-edit
     unadd = reset HEAD
     nvm = reset --hard HEAD
     hop = "!f() { git checkout -b $1 &amp;&amp; git checkout main &amp;&amp; git reset HEAD~ --hard &amp;&amp; git checkout $1;}; f"
</code></pre></div></div>]]></content><author><name>Romain Mathonat</name><email>romain.mathonat@gmail.com</email></author><summary type="html"><![CDATA[Local branch creation git checkout main git pull git checkout -b illustration_workflow Commit git add compute_sessions.py git commit -m "illustration du workflow"]]></summary></entry><entry><title type="html">FastAPI et exposition de services IA</title><link href="http://vulgairedev.fr/2022/12/05/fastAPI-IA.html" rel="alternate" type="text/html" title="FastAPI et exposition de services IA" /><published>2022-12-05T00:00:00+00:00</published><updated>2022-12-05T00:00:00+00:00</updated><id>http://vulgairedev.fr/2022/12/05/fastAPI-IA</id><content type="html" xml:base="http://vulgairedev.fr/2022/12/05/fastAPI-IA.html"><![CDATA[<h1 id="intro">Intro</h1>
<p>9 projets “data science” sur 10 <a href="https://towardsdatascience.com/why-90-percent-of-all-machine-learning-models-never-make-it-into-production-ce7e250d5a4a">ne finissent pas en production</a>.<br />
Une des raisons est la difficulté ainsi que le manque de normes pour passer d’un notebook à un produit fonctionnel réellement utile.  <br />
Dans ce tuto nous allons voir, à travers un cas simple, comment utiliser fastAPI pour créer une API permettant d’exposer des services IA, qui pourra ensuite être requetée depuis n’importe quelle brique logicielle, en HTTP. <br />
Plus précisement, nous allons ici récupérer des données d’utilisations d’un logiciel présent sur un parc de machines, stockées dans elasticsearch, que nous allons raffiner afin d’en extraire des sessions (clustering selon l’axe du temps uniquement).</p>

<h1 id="quest-ce-que-fastapi-">Qu’est ce que FastAPI ?</h1>
<p>D’après la <a href="https://fastapi.tiangolo.com/">tres bonne doc officielle</a>, FastAPI est “un framework web moderne, rapide pour construire des APIs python 3.7+ se basant sur les indices de typage standard python”.<br />
Il est rapide (comparable à go et NodeJS), permet de developper rapidement, simple, et fourni plusieurs d’outils assez pratiques.</p>

<h1 id="pourquoi-fastapi-plutot-que-dautres-web-servers-">Pourquoi FastAPI plutot que d’autres web servers ?</h1>

<ul>
  <li>bien plus léger que <a href="https://www.django-rest-framework.org/">Django Rest Framework</a></li>
  <li>plus performant que <a href="https://flask.palletsprojects.com/en/2.2.x/">flask</a>, profite du typage pour la doc et la validation automatique de la donnée d’entrée via <a href="https://pydantic-docs.helpmanual.io/">pydantic</a>.</li>
  <li>rajoute des utilitaires pratiques au dessus de <a href="https://www.starlette.io/">starlette</a>.</li>
</ul>

<h1 id="création-de-lenvironnement-virtuel">Création de l’environnement virtuel</h1>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>poetry init 
poetry add pandas fastapi[all] elasticsearch[async]<span class="o">==</span>7.13 requests pyYAML
</code></pre></div></div>

<p>L’arborescence du projet est alors:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tp_fast_api/   
├── .venv/
├── poetry.lock
└── pyproject.toml
</code></pre></div></div>
<h1 id="squelette-de-base">Squelette de base</h1>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">fastapi</span> <span class="kn">import</span> <span class="n">FastAPI</span>

<span class="n">app</span> <span class="o">=</span> <span class="n">FastAPI</span><span class="p">(</span><span class="n">debug</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">post</span><span class="p">(</span><span class="s">"/get_ecrans/"</span><span class="p">)</span> <span class="c1"># on fait un post pour simplifier l'envoi de données: dans le body directement
</span><span class="k">def</span> <span class="nf">get_ecrans</span><span class="p">():</span> 
    <span class="k">return</span> <span class="p">{</span><span class="s">"ecrans"</span><span class="p">:</span> <span class="p">[</span><span class="s">"ecran1"</span><span class="p">,</span> <span class="s">"ecran2"</span><span class="p">]}</span>
</code></pre></div></div>

<p>Lancer le serveur web avec guvicorn, apres avoir lancé l’environnement virtuel:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>poetry shell
uvicorn tp_fast_api.main:app <span class="nt">--reload</span>
</code></pre></div></div>
<p>Puis aller sur http://127.0.0.1:8000/docs</p>

<h1 id="ajout-des-parametres">Ajout des parametres</h1>
<p>On va maintenant ajouter des parametres: l’utilisateur sur lequel on veut requeter, une date min et une date max:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># main.py
</span><span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>

<span class="kn">from</span> <span class="nn">fastapi</span> <span class="kn">import</span> <span class="n">FastAPI</span>
<span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span>

<span class="kn">from</span> <span class="nn">tp_fast_api.data_collect</span> <span class="kn">import</span> <span class="n">extract_ecrans</span>


<span class="k">class</span> <span class="nc">Users</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">utils</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>
    <span class="n">date_min</span><span class="p">:</span> <span class="n">datetime</span>
    <span class="n">date_max</span><span class="p">:</span> <span class="n">datetime</span>


<span class="n">app</span> <span class="o">=</span> <span class="n">FastAPI</span><span class="p">(</span><span class="n">debug</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>


<span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">post</span><span class="p">(</span><span class="s">"/get_ecrans/"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">get_ecrans</span><span class="p">(</span><span class="n">users</span><span class="p">:</span> <span class="n">Users</span><span class="p">):</span>
    <span class="k">return</span> <span class="p">{</span><span class="s">"ecrans"</span><span class="p">:</span> <span class="p">[</span><span class="s">"ecran1"</span><span class="p">,</span> <span class="s">"ecran2"</span><span class="p">],</span> <span class="s">"users"</span><span class="p">:</span> <span class="n">users</span><span class="p">.</span><span class="n">utils</span><span class="p">}</span>
</code></pre></div></div>

<p>On a utilisé pour cela <a href="https://pydantic-docs.helpmanual.io/">pydantic</a>.<br />
Essayer d’envoyer le body suivant:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
 "utils": ["romain"],
 "date_min": "2022-09-22 11:00Z",
 "date_max": "2022-09-28T12:00+02:00"
}
</code></pre></div></div>
<p>On obtient un 200:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"ecrans"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="s2">"ecran1"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"ecran2"</span><span class="w">
  </span><span class="p">],</span><span class="w">
  </span><span class="nl">"users"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"utils"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="s2">"romain"</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"date_min"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2022-09-22T11:00:00+00:00"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"date_max"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2022-09-28T12:00:00+02:00"</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>On se rend compte que les dates ont été automatiquement parsées dans le bon type (bien que deux formats différents aient été envoyés)<br />
Pour la liste des types de date pris en charge par pydantic: https://pydantic-docs.helpmanual.io/usage/types/#datetime-types</p>

<p>Maintenant essayer de lancer avec une date dans un format inconnu, et “utils” mal écrit:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
 "util": ["romain"],
 "date_min": "2022-09-22 11:00Z",
 "date_max": "2022-09-28 / 12:00+02:00"
}
</code></pre></div></div>
<p>On obtient un 422:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"detail"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"loc"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="s2">"body"</span><span class="p">,</span><span class="w">
        </span><span class="s2">"utils"</span><span class="w">
      </span><span class="p">],</span><span class="w">
      </span><span class="nl">"msg"</span><span class="p">:</span><span class="w"> </span><span class="s2">"field required"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"value_error.missing"</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"loc"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="s2">"body"</span><span class="p">,</span><span class="w">
        </span><span class="s2">"date_max"</span><span class="w">
      </span><span class="p">],</span><span class="w">
      </span><span class="nl">"msg"</span><span class="p">:</span><span class="w"> </span><span class="s2">"invalid datetime format"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"value_error.datetime"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Avec uniquement ces quelques lignes en plus, on a:</p>
<ul>
  <li>une lecture du body</li>
  <li>une convertion et validation de type</li>
  <li>une gestion des erreurs parlante en cas de probleme</li>
  <li>un support de l’IDE pour l’autocompletion du body d’entrée qu’on souhaite manipuler</li>
  <li>une documentation directe dans OpenAPI pour l’utilisateur</li>
</ul>

<p>L’arborescence du projet est alors:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tp_fast_api/   
├── .venv/
├── poetry.lock
├── pyproject.toml
└── tp_fast_api/
    └── main.py
</code></pre></div></div>

<h1 id="requetage-sur-elastic">Requetage sur elastic</h1>
<p>On crée un dossier config/ à la racine, qui va contenir les credentials (en YAML). On y mettra tous les tokens et clés qui ne doivent pas être commit.
On y met notamment ici les tokens d’accès à elastic.</p>

<div class="language-yml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># credentials.yml</span>
<span class="na">ES_PROD_ID</span><span class="pi">:</span> <span class="s2">"</span><span class="s">&lt;id&gt;"</span>
<span class="na">ES_PROD_API_KEY</span><span class="pi">:</span> <span class="s2">"</span><span class="s">&lt;api_key&gt;"</span>
</code></pre></div></div>
<p>On rajoute en suite un fichier settings.py permettant d’exposer les variables globales du projet, qui va se charger de créer la connexion à Elastic:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># settings.py
</span><span class="kn">import</span> <span class="nn">pathlib</span>

<span class="kn">import</span> <span class="nn">yaml</span>
<span class="kn">from</span> <span class="nn">elasticsearch</span> <span class="kn">import</span> <span class="n">Elasticsearch</span>

<span class="n">CREDENTIAL_PATH</span> <span class="o">=</span> <span class="n">pathlib</span><span class="p">.</span><span class="n">Path</span><span class="p">(</span><span class="n">__file__</span><span class="p">).</span><span class="n">parent</span><span class="p">.</span><span class="n">parent</span> <span class="o">/</span> <span class="s">"config"</span> <span class="o">/</span> <span class="s">"credentials.yml"</span>

<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">CREDENTIAL_PATH</span><span class="p">,</span> <span class="s">"r"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">credentials</span> <span class="o">=</span> <span class="n">yaml</span><span class="p">.</span><span class="n">safe_load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>

<span class="n">URL_ELASTIC_PROD</span> <span class="o">=</span> <span class="s">"&lt;url_elastic&gt;:9200"</span>

<span class="n">ES_PROD</span> <span class="o">=</span> <span class="n">Elasticsearch</span><span class="p">(</span>
    <span class="n">hosts</span><span class="o">=</span><span class="p">[</span><span class="n">URL_ELASTIC_PROD</span><span class="p">],</span>
    <span class="n">request_timeout</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span>
    <span class="n">api_key</span><span class="o">=</span><span class="p">(</span><span class="n">credentials</span><span class="p">[</span><span class="s">"ES_PROD_ID"</span><span class="p">],</span> <span class="n">credentials</span><span class="p">[</span><span class="s">"ES_PROD_API_KEY"</span><span class="p">]),</span>
<span class="p">)</span>
</code></pre></div></div>

<p>On créé maintenant une fonction qui va aller directement chercher les données dans elastic, dans un module data_collect.py, à part:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># data_collect.py
</span><span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>

<span class="kn">from</span> <span class="nn">elasticsearch</span> <span class="kn">import</span> <span class="n">helpers</span>

<span class="kn">import</span> <span class="nn">tp_fast_api.settings</span> <span class="k">as</span> <span class="n">settings</span>


<span class="k">def</span> <span class="nf">extract_ecrans</span><span class="p">(</span><span class="n">utils</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span> <span class="n">date_min</span><span class="p">:</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">date_max</span><span class="p">:</span> <span class="n">datetime</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]]:</span>
    <span class="n">query</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">"query"</span><span class="p">:</span> <span class="p">{</span>
            <span class="s">"bool"</span><span class="p">:</span> <span class="p">{</span>
                <span class="s">"must"</span><span class="p">:</span> <span class="p">[{</span><span class="s">"terms"</span><span class="p">:</span> <span class="p">{</span><span class="s">"util"</span><span class="p">:</span> <span class="n">utils</span><span class="p">}}],</span>
                <span class="s">"filter"</span><span class="p">:</span> <span class="p">[</span>
                    <span class="p">{</span>
                        <span class="s">"range"</span><span class="p">:</span> <span class="p">{</span>
                            <span class="s">"@timestamp"</span><span class="p">:</span> <span class="p">{</span>
                                <span class="s">"gte"</span><span class="p">:</span> <span class="n">date_min</span><span class="p">.</span><span class="n">isoformat</span><span class="p">(),</span>
                                <span class="s">"lte"</span><span class="p">:</span> <span class="n">date_max</span><span class="p">.</span><span class="n">isoformat</span><span class="p">(),</span>
                                <span class="s">"format"</span><span class="p">:</span> <span class="s">"strict_date_optional_time"</span><span class="p">,</span>
                            <span class="p">}</span>
                        <span class="p">}</span>
                    <span class="p">}</span>
                <span class="p">],</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="n">data</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">helpers</span><span class="p">.</span><span class="n">scan</span><span class="p">(</span><span class="n">settings</span><span class="p">.</span><span class="n">ES_PROD</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="s">"&lt;my_index&gt;"</span><span class="p">,</span> <span class="n">query</span><span class="o">=</span><span class="n">query</span><span class="p">):</span>
        <span class="n">doc</span> <span class="o">=</span> <span class="n">doc</span><span class="p">[</span><span class="s">"_source"</span><span class="p">]</span>

        <span class="n">elt</span> <span class="o">=</span> <span class="p">[</span>
            <span class="n">doc</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"@timestamp"</span><span class="p">],</span>
            <span class="n">doc</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"util"</span><span class="p">),</span>
            <span class="n">doc</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"ecran"</span><span class="p">),</span>
            <span class="n">doc</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"instance"</span><span class="p">)</span>
        <span class="p">]</span>
        <span class="n">data</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">elt</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">data</span>
</code></pre></div></div>

<p>Puis on met à jour le main pour retourner ces écrans:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># main.py
</span><span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>

<span class="kn">from</span> <span class="nn">fastapi</span> <span class="kn">import</span> <span class="n">FastAPI</span>
<span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span>

<span class="kn">from</span> <span class="nn">tp_fast_api.data_collect</span> <span class="kn">import</span> <span class="n">extract_ecrans</span>


<span class="k">class</span> <span class="nc">Users</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">utils</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>
    <span class="n">date_min</span><span class="p">:</span> <span class="n">datetime</span>
    <span class="n">date_max</span><span class="p">:</span> <span class="n">datetime</span>


<span class="n">app</span> <span class="o">=</span> <span class="n">FastAPI</span><span class="p">(</span><span class="n">debug</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>


<span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">post</span><span class="p">(</span><span class="s">"/get_ecrans/"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">get_ecrans</span><span class="p">(</span><span class="n">users</span><span class="p">:</span> <span class="n">Users</span><span class="p">):</span>
    <span class="n">ecrans</span> <span class="o">=</span> <span class="n">extract_ecrans</span><span class="p">(</span><span class="n">users</span><span class="p">.</span><span class="n">utils</span><span class="p">,</span> <span class="n">users</span><span class="p">.</span><span class="n">date_min</span><span class="p">,</span> <span class="n">users</span><span class="p">.</span><span class="n">date_max</span><span class="p">)</span>
    <span class="k">return</span> <span class="p">{</span><span class="s">"users"</span><span class="p">:</span> <span class="n">users</span><span class="p">,</span> <span class="s">"ecrans"</span><span class="p">:</span> <span class="n">ecrans</span><span class="p">}</span>
</code></pre></div></div>

<p>On peut alors tester l’envoi de requete, et constater qu’on reçoit bien une liste d’écrans:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"ecrans"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">[</span><span class="w">
      </span><span class="mi">1663849892657</span><span class="p">,</span><span class="w">
      </span><span class="s2">"romain"</span><span class="p">,</span><span class="w">
      </span><span class="s2">"connexion"</span><span class="p">,</span><span class="w">
      </span><span class="s2">"instance589"</span><span class="p">,</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="p">[</span><span class="err">...</span><span class="p">]</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p><strong>NB</strong>: le typage des dates s’est effectué automatiquement grâce à pydantic, ça nous a évité la gestion des convertions qui peut s’avérer pénible.</p>

<p>L’arborescence est alors:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tp_fast_api
├── config
│   └── credentials.yml
├── poetry.lock        
├── pyproject.toml     
└── tp_fast_api
    ├── data_collect.py
    ├── main.py
    └── settings.py
</code></pre></div></div>

<h1 id="tests">Tests</h1>
<p>On va maintenant tester notre fonction. Pour cela, on va avoir besoin d’intercepter l’appel à elastic, et de mocker son resultat: l’environnement de test n’a pas forcément accès à elastic, et on ne veut rajouter de la charge potentielle sur le cluster à chaque requete.</p>

<p>On ajoute pytest:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>poetry add pytest
</code></pre></div></div>

<p>Puis on ajoute le test unitaire:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># test_data_collect.py
</span><span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>
<span class="kn">from</span> <span class="nn">unittest.mock</span> <span class="kn">import</span> <span class="n">patch</span>

<span class="kn">from</span> <span class="nn">tp_fast_api.data_collect</span> <span class="kn">import</span> <span class="n">extract_ecrans</span>


<span class="o">@</span><span class="n">patch</span><span class="p">(</span><span class="s">"tp_fast_api.data_collect.helpers.scan"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">test_extract_ecrans</span><span class="p">(</span><span class="n">mock_scan</span><span class="p">):</span>
    <span class="n">mock_scan</span><span class="p">.</span><span class="n">return_value</span> <span class="o">=</span> <span class="p">[</span>
        <span class="p">{</span>
            <span class="s">"_index"</span><span class="p">:</span> <span class="s">"mon_index"</span><span class="p">,</span>
            <span class="s">"_type"</span><span class="p">:</span> <span class="s">"_doc"</span><span class="p">,</span>
            <span class="s">"_id"</span><span class="p">:</span> <span class="s">"1660119206971bdcad61d5b"</span><span class="p">,</span>
            <span class="s">"_score"</span><span class="p">:</span> <span class="mf">2.0</span><span class="p">,</span>
            <span class="s">"_source"</span><span class="p">:</span> <span class="p">{</span>
                <span class="s">"ecran"</span><span class="p">:</span> <span class="s">"connexion"</span><span class="p">,</span>
                <span class="s">"instance"</span><span class="p">:</span> <span class="s">"instance589"</span><span class="p">,</span>
                <span class="s">"@timestamp"</span><span class="p">:</span> <span class="mi">1660119206971</span><span class="p">,</span>
                <span class="s">"util"</span><span class="p">:</span> <span class="s">"romain"</span>
            <span class="p">},</span>
        <span class="p">}</span>
    <span class="p">]</span>

    <span class="c1"># les parametres ici ne sont pas tres important puisque on intercepte la requete
</span>    <span class="n">results</span> <span class="o">=</span> <span class="n">extract_ecrans</span><span class="p">([</span><span class="s">"romain"</span><span class="p">],</span> <span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(),</span> <span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">())</span>

    <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">results</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
    <span class="k">assert</span> <span class="n">results</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s">"romain"</span>
</code></pre></div></div>

<p>Arborescence:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tp_fast_api
├── config
│   └── credentials.yml
├── tests
│   ├── __init__.py
│   └── test_data_collect.py
├── tp_fast_api
│   ├── __init__.py
│   ├── data_collect.py
│   ├── main.py
│   └── settings.py
├── poetry.lock
└── pyproject.toml
</code></pre></div></div>

<p>On lance les tests, avec pytest:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pytest <span class="nb">.</span>
</code></pre></div></div>
<p><img src="/assets/images/test100.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<h1 id="script-de-requetage">Script de requetage</h1>
<p>On va ecrire un code permettant de requeter à part notre service, afin notamment de faire un petit benchmark basique par la suite.
On lance 100 requetes en asynchrone, avec les mêmes parametres.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">asyncio</span>
<span class="kn">import</span> <span class="nn">datetime</span>
<span class="kn">import</span> <span class="nn">time</span>

<span class="kn">import</span> <span class="nn">httpx</span>


<span class="k">async</span> <span class="k">def</span> <span class="nf">query_api</span><span class="p">():</span>
    <span class="n">body</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">"date_max"</span><span class="p">:</span> <span class="n">datetime</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">().</span><span class="n">isoformat</span><span class="p">(),</span>
        <span class="s">"date_min"</span><span class="p">:</span> <span class="p">(</span><span class="n">datetime</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">()</span> <span class="o">-</span> <span class="n">datetime</span><span class="p">.</span><span class="n">timedelta</span><span class="p">(</span><span class="mi">7</span><span class="p">)).</span><span class="n">isoformat</span><span class="p">(),</span>
        <span class="s">"utils"</span><span class="p">:</span> <span class="p">[</span><span class="s">"util"</span><span class="p">],</span>
    <span class="p">}</span>
    <span class="k">async</span> <span class="k">with</span> <span class="n">httpx</span><span class="p">.</span><span class="n">AsyncClient</span><span class="p">()</span> <span class="k">as</span> <span class="n">client</span><span class="p">:</span>
        <span class="k">await</span> <span class="n">client</span><span class="p">.</span><span class="n">post</span><span class="p">(</span><span class="s">"http://127.0.0.1:8000/get_ecrans/"</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="n">body</span><span class="p">,</span> <span class="n">timeout</span><span class="o">=</span><span class="bp">None</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>

    <span class="n">loop</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">get_event_loop</span><span class="p">()</span>
    <span class="n">queries</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">gather</span><span class="p">(</span><span class="o">*</span><span class="p">[</span><span class="n">query_api</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">)])</span>
    <span class="n">loop</span><span class="p">.</span><span class="n">run_until_complete</span><span class="p">(</span><span class="n">queries</span><span class="p">)</span>
    <span class="n">loop</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>

    <span class="k">print</span><span class="p">(</span><span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span>
</code></pre></div></div>

<p>Temps d’execution:  16.18s  <br />
Note: puisque la requete elastic est toujours la même, elastic garde en cache les données.</p>

<h1 id="passage-en-asynchrone">Passage en asynchrone</h1>
<p>Pour bien comprendre le mécanisme d’asynchrone, voir l’excellente <a href="https://fastapi.tiangolo.com/async/">doc fastAPI</a><br />
On va maintenant adapter notre code pour profiter de l’asynchrone dans python, qui est géré par fastAPI (et starlette, derrière).
On commence par créér un nouveau main_async.py:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># main_async.py
</span><span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>

<span class="kn">from</span> <span class="nn">fastapi</span> <span class="kn">import</span> <span class="n">FastAPI</span>
<span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span>

<span class="kn">from</span> <span class="nn">tp_fast_api.data_collect</span> <span class="kn">import</span> <span class="n">extract_ecrans_async</span>


<span class="k">class</span> <span class="nc">Users</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">utils</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>
    <span class="n">date_min</span><span class="p">:</span> <span class="n">datetime</span>
    <span class="n">date_max</span><span class="p">:</span> <span class="n">datetime</span>


<span class="n">app</span> <span class="o">=</span> <span class="n">FastAPI</span><span class="p">(</span><span class="n">debug</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>


<span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">post</span><span class="p">(</span><span class="s">"/get_ecrans/"</span><span class="p">)</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">get_ecrans</span><span class="p">(</span><span class="n">users</span><span class="p">:</span> <span class="n">Users</span><span class="p">):</span>
    <span class="n">ecrans</span> <span class="o">=</span> <span class="k">await</span> <span class="n">extract_ecrans_async</span><span class="p">(</span><span class="n">users</span><span class="p">.</span><span class="n">utils</span><span class="p">,</span> <span class="n">users</span><span class="p">.</span><span class="n">date_min</span><span class="p">,</span> <span class="n">users</span><span class="p">.</span><span class="n">date_max</span><span class="p">)</span>
    <span class="k">return</span> <span class="p">{</span><span class="s">"ecrans"</span><span class="p">:</span> <span class="n">ecrans</span><span class="p">}</span>
</code></pre></div></div>

<p>On ajoute au settings.py une connexion asynchrone à elastic:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># settings.py
</span><span class="kn">from</span> <span class="nn">elasticsearch</span> <span class="kn">import</span> <span class="n">Elasticsearch</span><span class="p">,</span> <span class="n">AsyncElasticsearch</span>

<span class="c1"># ...
</span>
<span class="n">ES_ASYNC</span> <span class="o">=</span> <span class="n">AsyncElasticsearch</span><span class="p">(</span>
    <span class="n">hosts</span><span class="o">=</span><span class="p">[</span><span class="n">URL_ELASTIC_PROD</span><span class="p">],</span>
    <span class="n">request_timeout</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span>
    <span class="n">api_key</span><span class="o">=</span><span class="p">(</span><span class="n">credentials</span><span class="p">[</span><span class="s">"ES_PROD_ID"</span><span class="p">],</span> <span class="n">credentials</span><span class="p">[</span><span class="s">"ES_PROD_API_KEY"</span><span class="p">]),</span>
<span class="p">)</span>
</code></pre></div></div>

<p>On ajoute à data_collect.py une fonction de collecte des ecrans asynchrone:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># data_collect.py
</span>
<span class="c1"># ...
</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">extract_ecrans_async</span><span class="p">(</span>
    <span class="n">utils</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span> <span class="n">date_min</span><span class="p">:</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">date_max</span><span class="p">:</span> <span class="n">datetime</span>
<span class="p">):</span>
    <span class="n">query</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">"query"</span><span class="p">:</span> <span class="p">{</span>
            <span class="s">"bool"</span><span class="p">:</span> <span class="p">{</span>
                <span class="s">"must"</span><span class="p">:</span> <span class="p">[{</span><span class="s">"terms"</span><span class="p">:</span> <span class="p">{</span><span class="s">"util"</span><span class="p">:</span> <span class="n">utils</span><span class="p">}}],</span>
                <span class="s">"filter"</span><span class="p">:</span> <span class="p">[</span>
                    <span class="p">{</span>
                        <span class="s">"range"</span><span class="p">:</span> <span class="p">{</span>
                            <span class="s">"@timestamp"</span><span class="p">:</span> <span class="p">{</span>
                                <span class="s">"gte"</span><span class="p">:</span> <span class="n">date_min</span><span class="p">.</span><span class="n">isoformat</span><span class="p">(),</span>
                                <span class="s">"lte"</span><span class="p">:</span> <span class="n">date_max</span><span class="p">.</span><span class="n">isoformat</span><span class="p">(),</span>
                                <span class="s">"format"</span><span class="p">:</span> <span class="s">"strict_date_optional_time"</span><span class="p">,</span>
                            <span class="p">}</span>
                        <span class="p">}</span>
                    <span class="p">}</span>
                <span class="p">],</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="n">data</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">async</span> <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">helpers</span><span class="p">.</span><span class="n">async_scan</span><span class="p">(</span>
        <span class="n">settings</span><span class="p">.</span><span class="n">ES_ASYNC</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="s">"mon_index"</span><span class="p">,</span> <span class="n">query</span><span class="o">=</span><span class="n">query</span>
    <span class="p">):</span>
        <span class="n">doc</span> <span class="o">=</span> <span class="n">doc</span><span class="p">[</span><span class="s">"_source"</span><span class="p">]</span>

        <span class="n">elt</span> <span class="o">=</span> <span class="p">[</span>
            <span class="n">doc</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"@timestamp"</span><span class="p">],</span>
            <span class="n">doc</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"util"</span><span class="p">),</span>
            <span class="n">doc</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"ecran"</span><span class="p">),</span>
            <span class="n">doc</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"instance"</span><span class="p">)</span>
        <span class="p">]</span>
        <span class="n">data</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">elt</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">data</span>
</code></pre></div></div>

<p>Si on lance notre petit scrip de benchmark, on obtient:  11.13s<br />
Soit une augmentation de 1.5x la vitesse de traitement. En principe sur le papier on devrait plutôt être sur 2-3x dans une utilisation normale (il semblerait), avec des performances comparables à ce que peut donner un node express par exemple, et nettement plus que django ou flask.
Pour plus d’infos, voir <a href="https://christophergs.com/tutorials/ultimate-fastapi-tutorial-pt-9-asynchronous-performance-basics/">ici</a></p>

<h1 id="tests-en-asynchrone">Tests en asynchrone</h1>
<p>Utiliser pytest de maniere classique ne nous permet pas de tester les fonctions asynchrones.<br />
Pour ce faire, on commence par installer le package suivant:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>poetry add pytest-asyncio
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># test_data_collect.py
</span><span class="kn">import</span> <span class="nn">pytest</span>

<span class="o">@</span><span class="n">pytest</span><span class="p">.</span><span class="n">mark</span><span class="p">.</span><span class="n">asyncio</span> <span class="c1"># permet de tester une fonction asynchrone
</span><span class="o">@</span><span class="n">patch</span><span class="p">(</span><span class="s">"tp_fast_api.data_collect.helpers.async_scan"</span><span class="p">)</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">test_extract_ecrans_async</span><span class="p">(</span><span class="n">mock_scan</span><span class="p">):</span> <span class="c1"># on a un await dans le corps de la fonction, on doit donc la mettre en async
</span>    <span class="n">mock_scan</span><span class="p">.</span><span class="n">return_value</span> <span class="o">=</span> <span class="n">AsyncIteratorMock</span><span class="p">(</span>
        <span class="p">[</span>
            <span class="p">{</span>
                <span class="s">"_index"</span><span class="p">:</span> <span class="s">"mon_index"</span><span class="p">,</span>
                <span class="s">"_type"</span><span class="p">:</span> <span class="s">"_doc"</span><span class="p">,</span>
                <span class="s">"_id"</span><span class="p">:</span> <span class="s">"1660119206971bdcad61d5b"</span><span class="p">,</span>
                <span class="s">"_score"</span><span class="p">:</span> <span class="mf">2.0</span><span class="p">,</span>
                <span class="s">"_source"</span><span class="p">:</span> <span class="p">{</span>
                    <span class="s">"ecran"</span><span class="p">:</span> <span class="s">"connexion"</span><span class="p">,</span>
                    <span class="s">"instance"</span><span class="p">:</span> <span class="s">"instance589"</span><span class="p">,</span>
                    <span class="s">"@timestamp"</span><span class="p">:</span> <span class="mi">1660119206971</span><span class="p">,</span>
                    <span class="s">"util"</span><span class="p">:</span> <span class="s">"romain"</span>
                <span class="p">},</span>
            <span class="p">}</span>
        <span class="p">]</span>
    <span class="p">)</span>

    <span class="n">results</span> <span class="o">=</span> <span class="k">await</span> <span class="n">extract_ecrans_async</span><span class="p">([</span><span class="s">"romain"</span><span class="p">],</span> <span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(),</span> <span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">())</span>

    <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">results</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
    <span class="k">assert</span> <span class="n">results</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s">"romain"</span>
</code></pre></div></div>

<p><img src="/assets/images/test2.png" alt="" style="display:block; margin-left:auto; margin-right:auto" />  <br />
On remarque qu’on utilise un objet AsyncIteratorMock.<br />
En effet, extract_ecrans_async est une fonction de la forme “async for …”, qui necessite d’iterer sur un object possédant la méthode “__aiter__()”, ce qui n’est pas le cas de MagicMock (qu’on obtient grâce au @patch).<br />
On crée donc une classe qui encapsule ce comportement, à laquelle on fournit les données sur lesquelles nous souhaitons itérer (trouvé <a href="https://stackoverflow.com/questions/36695256/python-asyncio-how-to-mock-aiter-method">ici</a>):</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">AsyncIteratorMock</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="nb">iter</span> <span class="o">=</span> <span class="nb">iter</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">__aiter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">self</span>

    <span class="k">async</span> <span class="k">def</span> <span class="nf">__anext__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="k">return</span> <span class="nb">next</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="nb">iter</span><span class="p">)</span>
        <span class="k">except</span> <span class="nb">StopIteration</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">StopAsyncIteration</span>
</code></pre></div></div>

<h1 id="traitement-ia">Traitement IA</h1>
<p>Nous allons maintenant rajouter un traitement sur ces données, pour les clusteriser et obtenir des <strong>sessions d’interactions</strong>, delimités par un debut et une fin, plutot que d’avoir la donnée brut plus difficile à interpréter. Nous utiliserons l’algorithme <a href="http://vulgairedev.fr/blog/article/clustering-hdbscan">hdbscan</a>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>poetry add joblib==1.1.0 hdbscan
</code></pre></div></div>
<p>NB: Au moment d’écriture de ce notebook, il y a y un conflit entre joblib et hdbscan sous windows, il faut donc <a href="https://stackoverflow.com/questions/73830225/init-got-an-unexpected-keyword-argument-cachedir-when-importing-top2vec">specifier une version anterieure</a></p>

<p>Cette fois-ci commençons par écrire le test d’abord :</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># test_compute_sessions.py
</span><span class="kn">from</span> <span class="nn">tp_fast_api.compute_sessions</span> <span class="kn">import</span> <span class="n">compute_sessions</span>

 
<span class="k">def</span> <span class="nf">test_compute_sessions</span><span class="p">():</span>
    <span class="n">data</span> <span class="o">=</span> <span class="p">[</span>
       <span class="p">[</span><span class="s">"1"</span><span class="p">,</span> <span class="s">"romain"</span><span class="p">,</span> <span class="s">"instance"</span><span class="p">],</span> 
       <span class="p">[</span><span class="s">"2"</span><span class="p">,</span> <span class="s">"romain"</span><span class="p">,</span> <span class="s">"instance"</span><span class="p">],</span> 
       <span class="p">[</span><span class="s">"6"</span><span class="p">,</span> <span class="s">"romain"</span><span class="p">,</span> <span class="s">"instance"</span><span class="p">],</span> 
       <span class="p">[</span><span class="s">"10"</span><span class="p">,</span> <span class="s">"romain"</span><span class="p">,</span> <span class="s">"instance"</span><span class="p">],</span> 
       <span class="p">[</span><span class="s">"12"</span><span class="p">,</span> <span class="s">"romain"</span><span class="p">,</span> <span class="s">"instance"</span><span class="p">],</span> 
       <span class="p">[</span><span class="s">"13"</span><span class="p">,</span> <span class="s">"romain"</span><span class="p">,</span> <span class="s">"instance"</span><span class="p">],</span> 
       <span class="p">[</span><span class="s">"18"</span><span class="p">,</span> <span class="s">"anes"</span><span class="p">,</span> <span class="s">"instance"</span><span class="p">],</span>
       <span class="p">[</span><span class="s">"19"</span><span class="p">,</span> <span class="s">"anes"</span><span class="p">,</span> <span class="s">"instance"</span><span class="p">]</span> 
    <span class="p">]</span>

    <span class="n">data_cluster</span> <span class="o">=</span> <span class="n">compute_sessions</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
    <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">data_cluster</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span>
</code></pre></div></div>

<p>On va maintenant ecrire la fonction qui, pour une liste de données telle que retournée par extract_ecrans_async(), va nous clusteriser les données, et nous retourner une liste de dictionnaires, ayant chacun un debut, une fin, une instance et une personne.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># compute_sessions.py
</span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">hdbscan</span>

<span class="kn">import</span> <span class="nn">tp_fast_api.settings</span> <span class="k">as</span> <span class="n">settings</span>


<span class="k">def</span> <span class="nf">compute_sessions</span><span class="p">(</span><span class="n">data</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">list</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">:</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span>
        <span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span>
        <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"@timestamp"</span><span class="p">,</span> <span class="s">"personne"</span><span class="p">,</span> <span class="s">"instance"</span><span class="p">],</span>
    <span class="p">)</span>

    <span class="n">clusterer</span> <span class="o">=</span> <span class="n">hdbscan</span><span class="p">.</span><span class="n">HDBSCAN</span><span class="p">(</span><span class="n">min_cluster_size</span><span class="o">=</span><span class="n">settings</span><span class="p">.</span><span class="n">CLUSTER_MIN_POINTS</span><span class="p">)</span>

    <span class="n">windows</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">group</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">"personne"</span><span class="p">):</span>
        <span class="c1"># hdbscan pour un seul point semble nous mettre une erreur
</span>        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">group</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">:</span>
            <span class="n">group</span><span class="p">[</span><span class="s">"cluster"</span><span class="p">]</span> <span class="o">=</span> <span class="n">clusterer</span><span class="p">.</span><span class="n">fit_predict</span><span class="p">(</span>
                <span class="n">group</span><span class="p">[</span><span class="s">"@timestamp"</span><span class="p">].</span><span class="n">array</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
            <span class="p">)</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">group</span><span class="p">[</span><span class="s">"cluster"</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span>

        <span class="n">group</span> <span class="o">=</span> <span class="n">group</span><span class="p">[</span><span class="n">group</span><span class="p">[</span><span class="s">"cluster"</span><span class="p">]</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">]</span>  <span class="c1"># on supprime le bruit
</span>        <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">cluster_group</span> <span class="ow">in</span> <span class="n">group</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">"cluster"</span><span class="p">):</span>
            <span class="n">new_data</span> <span class="o">=</span> <span class="p">{</span>
                <span class="s">"debut"</span><span class="p">:</span> <span class="n">cluster_group</span><span class="p">[</span><span class="s">"@timestamp"</span><span class="p">].</span><span class="nb">min</span><span class="p">(),</span>
                <span class="s">"fin"</span><span class="p">:</span> <span class="n">cluster_group</span><span class="p">[</span><span class="s">"@timestamp"</span><span class="p">].</span><span class="nb">max</span><span class="p">(),</span>
                <span class="s">"instance"</span><span class="p">:</span> <span class="n">cluster_group</span><span class="p">.</span><span class="n">at</span><span class="p">[</span><span class="n">cluster_group</span><span class="p">.</span><span class="n">index</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="s">"instance"</span><span class="p">],</span>
                <span class="s">"personne"</span><span class="p">:</span> <span class="n">cluster_group</span><span class="p">.</span><span class="n">at</span><span class="p">[</span><span class="n">cluster_group</span><span class="p">.</span><span class="n">index</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="s">"personne"</span><span class="p">],</span>
            <span class="p">}</span>
            <span class="n">windows</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_data</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">windows</span>
</code></pre></div></div>

<p>Plus qu’à mettre à jour main_async:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># main_async.py
</span><span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>

<span class="kn">from</span> <span class="nn">fastapi</span> <span class="kn">import</span> <span class="n">FastAPI</span>
<span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span>

<span class="kn">from</span> <span class="nn">tp_fast_api.data_collect</span> <span class="kn">import</span> <span class="n">extract_ecrans_async</span>
<span class="kn">from</span> <span class="nn">tp_fast_api.compute_sessions</span> <span class="kn">import</span> <span class="n">compute_sessions</span>

<span class="k">class</span> <span class="nc">Users</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">utils</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>
    <span class="n">date_min</span><span class="p">:</span> <span class="n">datetime</span>
    <span class="n">date_max</span><span class="p">:</span> <span class="n">datetime</span>


<span class="n">app</span> <span class="o">=</span> <span class="n">FastAPI</span><span class="p">(</span><span class="n">debug</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>


<span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">post</span><span class="p">(</span><span class="s">"/get_ecrans/"</span><span class="p">)</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">get_ecrans</span><span class="p">(</span><span class="n">users</span><span class="p">:</span> <span class="n">Users</span><span class="p">):</span>
    <span class="n">ecrans</span> <span class="o">=</span> <span class="k">await</span> <span class="n">extract_ecrans_async</span><span class="p">(</span><span class="n">users</span><span class="p">.</span><span class="n">utils</span><span class="p">,</span> <span class="n">users</span><span class="p">.</span><span class="n">date_min</span><span class="p">,</span> <span class="n">users</span><span class="p">.</span><span class="n">date_max</span><span class="p">)</span>
    <span class="n">sessions</span> <span class="o">=</span> <span class="n">compute_sessions</span><span class="p">(</span><span class="n">ecrans</span><span class="p">)</span>
    <span class="k">return</span> <span class="p">{</span><span class="s">"session"</span><span class="p">:</span> <span class="n">sessions</span><span class="p">}</span>
</code></pre></div></div>

<p>Et on obtient bien un <strong>service qui nous permet, pour des parametres spécifiés, d’obtenir une donnée raffinée, mieux exploitable pour le metier</strong>. On pourrait ensuite faire une dataviz avec, calculer d’autres métriques à partir de cette information, etc.<br />
Pour la mise en production, plusieurs possibilités, voir la doc de <a href="https://fastapi.tiangolo.com/deployment/">fastAPI</a></p>

<h1 id="gestion-du-format-de-sortie">Gestion du format de sortie</h1>
<p>Pour finir, on va specifier grâce à pydantic le format de sortie, qui nous permettra d’eviter les erreurs de convertion de type dans la construction de la reponse, ainsi que mieux documenter notre api:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># main_async
</span><span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>

<span class="kn">from</span> <span class="nn">fastapi</span> <span class="kn">import</span> <span class="n">FastAPI</span>
<span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span>

<span class="kn">from</span> <span class="nn">tp_fast_api.data_collect</span> <span class="kn">import</span> <span class="n">extract_ecrans</span><span class="p">,</span> <span class="n">extract_ecrans_async</span>
<span class="kn">from</span> <span class="nn">tp_fast_api.compute_sessions</span> <span class="kn">import</span> <span class="n">compute_sessions</span>

<span class="n">app</span> <span class="o">=</span> <span class="n">FastAPI</span><span class="p">(</span><span class="n">debug</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>


<span class="k">class</span> <span class="nc">Users</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">utils</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>
    <span class="n">date_min</span><span class="p">:</span> <span class="n">datetime</span>
    <span class="n">date_max</span><span class="p">:</span> <span class="n">datetime</span>


<span class="k">class</span> <span class="nc">Session</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">debut</span><span class="p">:</span> <span class="n">datetime</span>
    <span class="n">fin</span><span class="p">:</span> <span class="n">datetime</span>
    <span class="n">instance</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">personne</span><span class="p">:</span> <span class="nb">str</span>


<span class="k">class</span> <span class="nc">OutResponse</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">sessions</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">Session</span><span class="p">]</span>


<span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">post</span><span class="p">(</span><span class="s">"/get_ecrans/"</span><span class="p">,</span> <span class="n">response_model</span><span class="o">=</span><span class="n">OutResponse</span><span class="p">)</span> 
<span class="k">async</span> <span class="k">def</span> <span class="nf">get_ecrans</span><span class="p">(</span><span class="n">users</span><span class="p">:</span> <span class="n">Users</span><span class="p">):</span>
    <span class="n">ecrans</span> <span class="o">=</span> <span class="k">await</span> <span class="n">extract_ecrans_async</span><span class="p">(</span><span class="n">users</span><span class="p">.</span><span class="n">utils</span><span class="p">,</span> <span class="n">users</span><span class="p">.</span><span class="n">date_min</span><span class="p">,</span> <span class="n">users</span><span class="p">.</span><span class="n">date_max</span><span class="p">)</span>
    <span class="n">sessions</span> <span class="o">=</span> <span class="k">await</span> <span class="n">compute_sessions</span><span class="p">(</span><span class="n">ecrans</span><span class="p">)</span>
    <span class="k">return</span> <span class="p">{</span><span class="s">"sessions"</span><span class="p">:</span> <span class="n">sessions</span><span class="p">}</span>
</code></pre></div></div>

<h1 id="typage-des-données-delastic">Typage des données d’elastic</h1>

<p>Apres reflexion, on se rend compte que le typage des données qui sortent d’elastic est important lui aussi: sans celui-ci, certains champs texte du type “01” pourront être malencrontreusement transformé en float ou int. 
Pour régler ce probleme, on pourrait utiliser <a href="https://elasticsearch-dsl.readthedocs.io/en/latest/">elasticsearch-dsl</a>, cependant celui-ci ne fonctionne pas en asynchrone.
On peut donc utiliser directement pydantic, par exemple:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># data_collect.py
</span><span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>

<span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span>
<span class="kn">from</span> <span class="nn">elasticsearch</span> <span class="kn">import</span> <span class="n">helpers</span>

<span class="kn">import</span> <span class="nn">tp_fast_api.settings</span> <span class="k">as</span> <span class="n">settings</span>

<span class="k">class</span> <span class="nc">Ecran</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">ecran</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">instance</span><span class="p">:</span> <span class="nb">str</span> 
    <span class="n">timestamp</span><span class="p">:</span> <span class="n">datetime</span>
    <span class="n">util</span><span class="p">:</span> <span class="nb">str</span>

    
<span class="k">async</span> <span class="k">def</span> <span class="nf">extract_ecrans_async</span><span class="p">(</span><span class="n">utils</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span> <span class="n">date_min</span><span class="p">:</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">date_max</span><span class="p">:</span> <span class="n">datetime</span><span class="p">):</span>
    <span class="n">query</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">"query"</span><span class="p">:</span> <span class="p">{</span>
            <span class="s">"bool"</span><span class="p">:</span> <span class="p">{</span>
                <span class="s">"must"</span><span class="p">:</span> <span class="p">[{</span><span class="s">"terms"</span><span class="p">:</span> <span class="p">{</span><span class="s">"util"</span><span class="p">:</span> <span class="n">utils</span><span class="p">}}],</span>
                <span class="s">"filter"</span><span class="p">:</span> <span class="p">[</span>
                    <span class="p">{</span>
                        <span class="s">"range"</span><span class="p">:</span> <span class="p">{</span>
                            <span class="s">"@timestamp"</span><span class="p">:</span> <span class="p">{</span>
                                <span class="s">"gte"</span><span class="p">:</span> <span class="n">date_min</span><span class="p">.</span><span class="n">isoformat</span><span class="p">(),</span>
                                <span class="s">"lte"</span><span class="p">:</span> <span class="n">date_max</span><span class="p">.</span><span class="n">isoformat</span><span class="p">(),</span>
                                <span class="s">"format"</span><span class="p">:</span> <span class="s">"strict_date_optional_time"</span><span class="p">,</span>
                            <span class="p">}</span>
                        <span class="p">}</span>
                    <span class="p">}</span>
                <span class="p">],</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="n">data</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">async</span> <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">helpers</span><span class="p">.</span><span class="n">async_scan</span><span class="p">(</span><span class="n">settings</span><span class="p">.</span><span class="n">ES_ASYNC</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="s">"mon_index"</span><span class="p">,</span> <span class="n">query</span><span class="o">=</span><span class="n">query</span><span class="p">):</span>
        <span class="n">doc</span> <span class="o">=</span> <span class="n">doc</span><span class="p">[</span><span class="s">"_source"</span><span class="p">]</span>
        <span class="n">doc_parsed</span> <span class="o">=</span> <span class="n">Ecran</span><span class="p">(</span><span class="o">**</span><span class="n">doc</span><span class="p">,</span> <span class="n">timestamp</span><span class="o">=</span><span class="n">doc</span><span class="p">[</span><span class="s">"@timestamp"</span><span class="p">])</span>
        <span class="n">elt</span> <span class="o">=</span> <span class="p">[</span>
            <span class="n">doc_parsed</span><span class="p">.</span><span class="n">ecran</span><span class="p">,</span>
            <span class="n">doc_parsed</span><span class="p">.</span><span class="n">timestamp</span><span class="p">,</span>
            <span class="n">doc_parsed</span><span class="p">.</span><span class="n">util</span><span class="p">,</span>
            <span class="n">doc_parsed</span><span class="p">.</span><span class="n">instance</span><span class="p">,</span>
        <span class="p">]</span>
        <span class="n">data</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">elt</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">data</span>
</code></pre></div></div>

<p>NB: pour être plus propre, mieux vaut mettre les modeles de données pydantic à part dans un module modeles.py</p>]]></content><author><name>Romain Mathonat</name><email>romain.mathonat@gmail.com</email></author><summary type="html"><![CDATA[Intro 9 projets “data science” sur 10 ne finissent pas en production. Une des raisons est la difficulté ainsi que le manque de normes pour passer d’un notebook à un produit fonctionnel réellement utile. Dans ce tuto nous allons voir, à travers un cas simple, comment utiliser fastAPI pour créer une API permettant d’exposer des services IA, qui pourra ensuite être requetée depuis n’importe quelle brique logicielle, en HTTP. Plus précisement, nous allons ici récupérer des données d’utilisations d’un logiciel présent sur un parc de machines, stockées dans elasticsearch, que nous allons raffiner afin d’en extraire des sessions (clustering selon l’axe du temps uniquement).]]></summary></entry><entry><title type="html">Clustering: présentation de HDBSCAN</title><link href="http://vulgairedev.fr/2021/07/20/HDBSCAN.html" rel="alternate" type="text/html" title="Clustering: présentation de HDBSCAN" /><published>2021-07-20T00:00:00+00:00</published><updated>2021-07-20T00:00:00+00:00</updated><id>http://vulgairedev.fr/2021/07/20/HDBSCAN</id><content type="html" xml:base="http://vulgairedev.fr/2021/07/20/HDBSCAN.html"><![CDATA[<p>Le <strong>clustering</strong> est une tâche qui consiste à automatiquement grouper des objets similaires. On cherche à minimiser la distance inter-groupement et à maximiser la distance entre les groupements (les definitions varient légèrement selon les papiers cependant).</p>

<p>Les algorithmes de clustering sont très utiles pour faire de <em>l’analyse de données exploratoire</em>, c’est à dire pour étudier un dataset et le faire parler sans connaissance à priori dessus. Les cas d’utilisations sont divers: labeliser automatiquement un jeu de données (étape coûteuse si faite par un expert), <a href="https://www.kdnuggets.com/2020/11/topic-modeling-bert.html">découvrir automatiquement des sujets de discussion</a>, mieux comprendre le domaine sous-jascent, etc.</p>

<p><strong>HDBSCAN</strong> (Hierarchichal DBSCAN) est un algorithme de clustering proposé par Campello et Al. en 2013 [2]. Il part du principe que:</p>
<ul>
  <li>les algorithmes de clustering basés sur la densité, comme <a href="https://fr.wikipedia.org/wiki/DBSCAN">DBSCAN</a>, ne clusterisent que selon un seuil de densité global, ce qui va empecher de trouver des clusters de densité trop variables.</li>
  <li>les algorithmes de type <a href="https://fr.wikipedia.org/wiki/Regroupement_hi%C3%A9rarchique">clustering hierarchique</a> sont aussi intéressants, mais peuvent avoir une hiérarchie trop complexe, difficilement interprétable.</li>
  <li>Un autre problème rencontré est aussi la multiplication de paramètres, influençant grandement le résultat (par exemple le nombre de classes est à spécifier pour <a href="https://fr.wikipedia.org/wiki/K-moyennes">k-means</a>)</li>
</ul>

<p>Notons qu’un algorithme de clustering est différent d’un algorithme de partitionnement, comme k-means. Le but de ce dernier est d’associer à tout élément un des k regroupements, en minimisant la distance intra-regroupement. Dans notre définition du clustering, on s’autorise à avoir des points qui n’appartiennent à aucun regroupement: ils sont considérés comme étant du <em>bruit</em>.</p>

<p>En deux mots, HDBSCAN est un mélange entre un algorithme de clustering hierarchique et DBSCAN. Il va permettre de considérer des clusters de densités différentes, ne requiert que peu de paramétrage, donne de très bons résultats. De plus, une implémentation performante et intégrée à sk-learn <a href="https://github.com/scikit-learn-contrib/hdbscan">a été proposée</a> suite à des travaux plus récents [1].</p>

<p><strong>NB</strong>: Les illustrations qui vont suivre sont prises depuis [4]. Ce notebook n’est pas une simple traduction mais une manière un peu différente de présenter l’algorithme.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="kn">import</span> <span class="nn">sklearn.datasets</span> <span class="k">as</span> <span class="n">data</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="n">plt</span><span class="p">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s">'figure.figsize'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="mi">14</span><span class="p">,</span> <span class="mi">8</span><span class="p">]</span>
<span class="kn">from</span> <span class="nn">IPython.core.display</span> <span class="kn">import</span> <span class="n">HTML</span>
<span class="n">HTML</span><span class="p">(</span><span class="s">"""
&lt;style&gt;
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
&lt;/style&gt;
"""</span><span class="p">)</span>

<span class="n">sns</span><span class="p">.</span><span class="n">set_context</span><span class="p">(</span><span class="s">'poster'</span><span class="p">)</span>
<span class="n">sns</span><span class="p">.</span><span class="n">set_style</span><span class="p">(</span><span class="s">'white'</span><span class="p">)</span>
<span class="n">sns</span><span class="p">.</span><span class="n">set_color_codes</span><span class="p">()</span>
<span class="n">plot_kwds</span> <span class="o">=</span> <span class="p">{</span><span class="s">'alpha'</span> <span class="p">:</span> <span class="mf">0.5</span><span class="p">,</span> <span class="s">'s'</span> <span class="p">:</span> <span class="mi">80</span><span class="p">,</span> <span class="s">'linewidths'</span><span class="p">:</span><span class="mi">0</span><span class="p">}</span>
</code></pre></div></div>

<h3 id="définitions">Définitions</h3>
<p><strong>Core distance d’un point dcore(A)</strong>: distance au mPts plus proche voisin.
Plus elle est petite, plus la densité de points est forte au voisinage d’un point</p>

<p><img src="/assets/images/1.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p><strong>Mutual Reachibility Distance dMRD(A, B)</strong>: entre deux objets, la MRD est la valeur max entre la distance entre les deux objets, la core distance du premier object, et la core distance du deuxième objet.</p>

<p>Par exemple, la MRE entre le point vert et bleu va correspondre à la core distance du point vert, à la différence de la MRD entre le point vert et rouge, qui correpond à la distance entre ces deux points.</p>

<p><img src="/assets/images/2.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p><strong>Mutual Reachability Graph</strong>: le graph complet où chaque pair de points est relié par un arc pondéré par sa MRD.</p>

<h3 id="equivalence-entre-le-mutual-reachability-graph-et-dbscan">Equivalence entre le mutual reachability graph et DBSCAN</h3>
<p>(On introduit la notion d’ε-voisinage uniquement pour DBSCAN)<br />
<strong>ε-voisinage</strong>: pour un point A, ensemble des autres points à une distance &lt; ε de A.</p>

<p>Le principe de DBSCAN, revisité dans le papier HDBSCAN, est de faire des groupements maximaux de points (=cluster) dont on sait que:</p>
<ul>
  <li>chaque point à au moins mpts dans son ε-voisinage.</li>
  <li>toute paire de points (A, B) dans un groupement est connecté, c’est à dire que soit A est dans le ε-voisinage de B et reciproquement, soit il y a une chaîne de points entre A et B pour lesquels cette propriété tiens.</li>
</ul>

<p><strong>NB</strong>: Ce n’est pas <em>exactement</em> DBSCAN, on retire ici les points frontières.</p>

<p>Imaginons maintenant qu’on prenne le mutual reachability graph, qu’on enlève les arcs qui sont de poids &gt; ε, et qu’on enlève les points isolés (outliers).
On a alors, pour tous points A, B connectés dans ce nouveau graph:<br />
\(d_{MRD}(A, B) \leq ε\) \(d_{core}(A) \leq ε\) \(d_{core}(B) \leq ε\) \(d(A, B) \leq ε\)</p>

<p>On a alors des groupements de points dont on sait qu’ils ont chacun au moins mpts dans leur ε-voisinage. De plus, chaque point des couples de points directement reliés est dans le ε-voisinage de l’autre. On est donc bien en présence d’un clustering équivalent à ce que produirait DBSCAN.</p>

<p><strong>Proposition</strong>: Si on lance un algorithme de clustering hierarchique (Single Linkage) sur le mutual reachability graph, on obtient un dendogramme. Si on le coupe au niveau ε, on obtient un clustering de DBSCAN.</p>

<p>Cette méthode n’est cependant pas optimisée, les auteurs ont donc proposé d’implémenter cette idée d’une manière un peu différente.</p>

<h3 id="utilisation-du-minimum-spanning-tree-mst">Utilisation du Minimum Spanning Tree (MST)</h3>
<p>Tout d’abord, générons quelques points pour la suite.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">moons</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">make_moons</span><span class="p">(</span><span class="n">n_samples</span><span class="o">=</span><span class="mi">150</span><span class="p">,</span> <span class="n">noise</span><span class="o">=</span><span class="mf">0.08</span><span class="p">)</span>
<span class="n">blobs</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">make_blobs</span><span class="p">(</span><span class="n">n_samples</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">centers</span><span class="o">=</span><span class="p">[(</span><span class="o">-</span><span class="mf">0.75</span><span class="p">,</span><span class="mf">2.25</span><span class="p">),</span> <span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">)],</span> <span class="n">cluster_std</span><span class="o">=</span><span class="mf">0.40</span><span class="p">)</span>
<span class="n">test_data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">vstack</span><span class="p">([</span><span class="n">moons</span><span class="p">,</span> <span class="n">blobs</span><span class="p">])</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">test_data</span><span class="p">.</span><span class="n">T</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">test_data</span><span class="p">.</span><span class="n">T</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">color</span><span class="o">=</span><span class="s">'b'</span><span class="p">,</span> <span class="o">**</span><span class="n">plot_kwds</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;matplotlib.collections.PathCollection at 0x21697a29400&gt;
</code></pre></div></div>

<p><img src="/assets/images/HDBSCAN_7_1.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>On lance hdbscan pour mieux illustrer la suite.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">hdbscan</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">clusterer</span> <span class="o">=</span> <span class="n">hdbscan</span><span class="p">.</span><span class="n">HDBSCAN</span><span class="p">(</span><span class="n">min_cluster_size</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">gen_min_span_tree</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">clusterer</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">test_data</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HDBSCAN(gen_min_span_tree=True)
</code></pre></div></div>

<p>La première étape consiste à générer un <a href="https://fr.wikipedia.org/wiki/Arbre_couvrant_de_poids_minimal">MST</a>, avec la particularité que chaque noeud possède un arc vers lui-même, avec comme poids sa core distance.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">clusterer</span><span class="p">.</span><span class="n">minimum_spanning_tree_</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">edge_cmap</span><span class="o">=</span><span class="s">'viridis'</span><span class="p">,</span> 
                                      <span class="n">edge_alpha</span><span class="o">=</span><span class="mf">0.6</span><span class="p">,</span> 
                                      <span class="n">node_size</span><span class="o">=</span><span class="mi">60</span><span class="p">,</span> 
                                      <span class="n">edge_linewidth</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;matplotlib.axes._subplots.AxesSubplot at 0x2169729bf10&gt;
</code></pre></div></div>

<p><img src="/assets/images/HDBSCAN_12_1.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>En fait, le dendogramme générée par un clustering hirerarchique (en single linkage) depuis le mutual reacheability graph peut être créé en générant cet MST, et en supprimant successivement les liens selon leur poids décroissant. On peut générer cet MST avec <a href="https://fr.wikipedia.org/wiki/Algorithme_de_Prim">l’algorithme de Prim</a>, par exemple.</p>

<p>On obtient donc le dendogramme suivant:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">clusterer</span><span class="p">.</span><span class="n">single_linkage_tree_</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">cmap</span><span class="o">=</span><span class="s">'viridis'</span><span class="p">,</span> <span class="n">colorbar</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;matplotlib.axes._subplots.AxesSubplot at 0x21695d02bb0&gt;
</code></pre></div></div>

<p><img src="/assets/images/HDBSCAN_14_1.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Le pseudocode principal de l’algorithme est le suivant:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. Calculer la core distance de paramètre m_pts pour tout point X
2. Calculer le MST du Mutual Reachability Graph
3. Ajouter sur chaque noeud un arc pointant vers lui-même, ayant un poids de sa core_distance
4. Extraire le dendogramme depuis le MST:
    4.1 Pour la racine, mettre tous les objets dans le même cluster
    4.2 Pour chaque arc du MST, par ordre de poids décroissant (enlever tous ensemble ceux qui sont égaux):
        4.2.1 Définir la valeur courante du dendogramme sur la valeur de l'arc qu'on enlève
        4.2.2 Assigner de nouveaux labels au nouveaux clusters créés. Si l'un d'entre eux n'a plus d'arc, l'enlever (bruit).
        
</code></pre></div></div>

<p>La question maintenant va être de définir comment extraire des clusters depuis ce schéma. La méthode DBSCAN consisterait à tirer un trait horizontal, et de prendre tous les clusters de ce niveau. Mais nous voulons autoriser des variations de densités. Comment faire ?</p>

<h3 id="simplification-hierarchique">Simplification Hierarchique</h3>

<p>On introduit un nouveau paramètre mclSize, qui correpond au nombre minimal d’éléments dans un cluster (les auteurs conseillent de fixer mclSize = mpts).</p>

<p>L’idée ici va être de “lisser” les clusters, en considérant que les clusters créés lors d’une séparation sont du bruit s’ils n’ont pas mclSize points, et donc ils ne constituent pas un vrai “split”.</p>

<p>On redéfinit donc l’étape 4.2 de l’algorithme principal:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>4.2 Pour chaque arc du MST, par ordre de poids décroissant (enlever tous ensemble ceux qui sont égaux):
      Si taille cluster formé &lt; m_clSize -&gt; Noise
      Si un seul cluster est créé -&gt; garder le nom du cluster parent
      Si &gt; 1 cluster, et que chacun à une taille &gt; m_clSize -&gt; assigner un nouveau label à chacun.
</code></pre></div></div>

<p>L’implémentation que nous utilisons de HDBSCAN nous permet de représenter les clusters au fil de l’algorithme.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">clusterer</span><span class="p">.</span><span class="n">condensed_tree_</span><span class="p">.</span><span class="n">plot</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;matplotlib.axes._subplots.AxesSubplot at 0x21697b15fa0&gt;
</code></pre></div></div>

<p><img src="/assets/images/HDBSCAN_18_1.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>On constate que certains points “tombent” des différents clusters lors de la progression de l’algorithme. La $\lambda\ value$ correspond à $\frac{1}{\epsilon}$.</p>

<p>Maintenant que nous avons une hierarchie simplifiée, il nous faut en extraire des clusters.</p>

<h3 id="extraction-des-clusters">Extraction des clusters</h3>

<p>Notons qu’il faut que pour un point donnée, celui-ci ne soit couvert que par un seul cluster.
L’idée ici va être d’utiliser une mesure de stabilité pour choisir quels clusters sont les plus pertinents.</p>

<p>On définit la stabilité d’un cluster par:</p>

<p>\(S(C) = \sum_{x \in C}^{} (\lambda_{max}(x, C) - \lambda_{min}(C))\)
où λmin est le niveau de densité pour lequel le cluster s’est créé, λmax(x, C) est le niveau de densité auquel l’objet x disparait de C (soit quand le cluster est split, soit quand il devient du bruit). Un cluster est donc d’autant plus stable qu’il contient de nombreux objets qui “restent” (càd ne sont pas considérés comme étant du bruit lorsqu’on augmente la densité minimale).</p>

<p>On va ensuite procéder de manière bottom-up, en considérant que tous les clusters du bas sont selectionnés, puis en remonte suivant la règle suivante, à chaque fusion de clusters:</p>
<ul>
  <li>si le cluster parent à une stabilité plus grande, le selectionner, et deselectionner les deux enfants.</li>
  <li>Sinon, garder les enfants et mettre la valeur du cluster selectionné égale à la somme des enfants.</li>
</ul>

<p>On remonte ainsi jusqu’à la racine, et on obtient notre ensemble de clusters selectionnés.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">clusterer</span><span class="p">.</span><span class="n">condensed_tree_</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">select_clusters</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">selection_palette</span><span class="o">=</span><span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;matplotlib.axes._subplots.AxesSubplot at 0x216980a5f40&gt;
</code></pre></div></div>

<p><img src="/assets/images/HDBSCAN_20_1.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">palette</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">()</span>
<span class="n">cluster_colors</span> <span class="o">=</span> <span class="p">[</span><span class="n">sns</span><span class="p">.</span><span class="n">desaturate</span><span class="p">(</span><span class="n">palette</span><span class="p">[</span><span class="n">col</span><span class="p">],</span> <span class="n">sat</span><span class="p">)</span> 
                  <span class="k">if</span> <span class="n">col</span> <span class="o">&gt;=</span> <span class="mi">0</span> <span class="k">else</span> <span class="p">(</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">)</span> <span class="k">for</span> <span class="n">col</span><span class="p">,</span> <span class="n">sat</span> <span class="ow">in</span> 
                  <span class="nb">zip</span><span class="p">(</span><span class="n">clusterer</span><span class="p">.</span><span class="n">labels_</span><span class="p">,</span> <span class="n">clusterer</span><span class="p">.</span><span class="n">probabilities_</span><span class="p">)]</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">test_data</span><span class="p">.</span><span class="n">T</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">test_data</span><span class="p">.</span><span class="n">T</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">cluster_colors</span><span class="p">,</span> <span class="o">**</span><span class="n">plot_kwds</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;matplotlib.collections.PathCollection at 0x2169936e910&gt;
</code></pre></div></div>

<p><img src="/assets/images/HDBSCAN_21_1.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<h3 id="quelques-comparaisons">Quelques comparaisons</h3>
<p>Essayons avec le célèbre k-means.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">sklearn.cluster</span> <span class="k">as</span> <span class="n">cluster</span>
<span class="kn">import</span> <span class="nn">time</span> 
<span class="n">test_data</span>

<span class="k">def</span> <span class="nf">plot_clusters</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">algorithm</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwds</span><span class="p">):</span>
    <span class="n">start_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
    <span class="n">labels</span> <span class="o">=</span> <span class="n">algorithm</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwds</span><span class="p">).</span><span class="n">fit_predict</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
    <span class="n">end_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
    <span class="n">palette</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">(</span><span class="s">'deep'</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="n">labels</span><span class="p">).</span><span class="nb">max</span><span class="p">()</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
    <span class="n">colors</span> <span class="o">=</span> <span class="p">[</span><span class="n">palette</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="k">if</span> <span class="n">x</span> <span class="o">&gt;=</span> <span class="mi">0</span> <span class="k">else</span> <span class="p">(</span><span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">labels</span><span class="p">]</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">T</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">data</span><span class="p">.</span><span class="n">T</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">colors</span><span class="p">,</span> <span class="o">**</span><span class="n">plot_kwds</span><span class="p">)</span>
    <span class="n">frame</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">gca</span><span class="p">()</span>
    <span class="n">frame</span><span class="p">.</span><span class="n">axes</span><span class="p">.</span><span class="n">get_xaxis</span><span class="p">().</span><span class="n">set_visible</span><span class="p">(</span><span class="bp">False</span><span class="p">)</span>
    <span class="n">frame</span><span class="p">.</span><span class="n">axes</span><span class="p">.</span><span class="n">get_yaxis</span><span class="p">().</span><span class="n">set_visible</span><span class="p">(</span><span class="bp">False</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Clusters found by {}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">algorithm</span><span class="p">.</span><span class="n">__name__</span><span class="p">)),</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">24</span><span class="p">)</span>
    
<span class="n">plot_clusters</span><span class="p">(</span><span class="n">test_data</span><span class="p">,</span> <span class="n">cluster</span><span class="p">.</span><span class="n">KMeans</span><span class="p">,</span> <span class="p">(),</span> <span class="p">{</span><span class="s">'n_clusters'</span><span class="p">:</span><span class="mi">4</span><span class="p">})</span>
</code></pre></div></div>

<p><img src="/assets/images/HDBSCAN_23_0.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Tous les points ont été sélectionnés, mais on voit qu’on a un problème pour les formes de croissants en bas.
Si on essaie de spécifier 3 classes, on obtient aussi un résultat peu satisfaisant:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plot_clusters</span><span class="p">(</span><span class="n">test_data</span><span class="p">,</span> <span class="n">cluster</span><span class="p">.</span><span class="n">KMeans</span><span class="p">,</span> <span class="p">(),</span> <span class="p">{</span><span class="s">'n_clusters'</span><span class="p">:</span><span class="mi">3</span><span class="p">})</span>
</code></pre></div></div>

<p><img src="/assets/images/HDBSCAN_25_0.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Essayons maintenant avec DBSCAN:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plot_clusters</span><span class="p">(</span><span class="n">test_data</span><span class="p">,</span> <span class="n">cluster</span><span class="p">.</span><span class="n">DBSCAN</span><span class="p">,</span> <span class="p">(),</span> <span class="p">{</span><span class="s">'eps'</span><span class="p">:</span><span class="mf">0.1</span><span class="p">})</span>
</code></pre></div></div>

<p><img src="/assets/images/HDBSCAN_27_0.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>On a le problème du paramétrage qui n’est pas aisé. En tatonant, le meilleur clustering que je puisse trouver est le suivant:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plot_clusters</span><span class="p">(</span><span class="n">test_data</span><span class="p">,</span> <span class="n">cluster</span><span class="p">.</span><span class="n">DBSCAN</span><span class="p">,</span> <span class="p">(),</span> <span class="p">{</span><span class="s">'eps'</span><span class="p">:</span><span class="mf">0.35</span><span class="p">})</span>
</code></pre></div></div>

<p><img src="/assets/images/HDBSCAN_29_0.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Pour plus d’expériences et de comparaisons, voir [1] et [2]. Pour des comparatifs de performances voir [5].</p>

<h3 id="références">Références:</h3>
<p>[1] L. McInnes et J. Healy, « Accelerated Hierarchical Density Clustering », 2017 IEEE International Conference on Data Mining Workshops (ICDMW), p. 33‑42, nov. 2017.</p>

<p>[2] R. J. G. B. Campello, D. Moulavi, et J. Sander, « Density-Based Clustering Based on Hierarchical Density Estimates », in Advances in Knowledge Discovery and Data Mining, vol. 7819, J. Pei, V. S. Tseng, L. Cao, H. Motoda, et G. Xu, Éd. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, p. 160‑172.</p>

<p>[3] https://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb (consulté le nov. 24, 2020).</p>

<p>[4] https://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb (consulté le nov. 24, 2020).</p>

<p>[5] https://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Benchmarking%20scalability%20of%20clustering%20implementations-v0.7.ipynb (consulté le nov. 24, 2020).</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
</code></pre></div></div>]]></content><author><name>Romain Mathonat</name><email>romain.mathonat@gmail.com</email></author><summary type="html"><![CDATA[Le clustering est une tâche qui consiste à automatiquement grouper des objets similaires. On cherche à minimiser la distance inter-groupement et à maximiser la distance entre les groupements (les definitions varient légèrement selon les papiers cependant).]]></summary></entry><entry><title type="html">Erreurs et manipulations en temps d’épidémie</title><link href="http://vulgairedev.fr/2021/07/02/covid-stats-erreurs.html" rel="alternate" type="text/html" title="Erreurs et manipulations en temps d’épidémie" /><published>2021-07-02T00:00:00+00:00</published><updated>2021-07-02T00:00:00+00:00</updated><id>http://vulgairedev.fr/2021/07/02/covid-stats-erreurs</id><content type="html" xml:base="http://vulgairedev.fr/2021/07/02/covid-stats-erreurs.html"><![CDATA[<p>Le sujet polarise énormément, je vais donc essayer de m’en tenir au fond pour tenter d’y voir plus clair parmi plusieurs erreurs ou manipulations que j’ai pu voir ces derniers temps.
En particulier, un <a href="https://blogs.mediapart.fr/laurent-mucchielli/blog/300721/la-vaccination-covid-l-epreuve-des-faits-2eme-partie-une-mortalite-inedite">article</a> a récemment été publié sur le blog de mediapart (il n’engage donc pas la rédaction) (MAJ: il a été retiré et republié <a href="https://www.francesoir.fr/opinions-tribunes/la-vaccination-covid-lepreuve-des-faits-2eme-partie-une-mortalite-inedite">ici</a>). Il a été rédigé par Laurent Mucchielli, directeur de recherche au CRNS en sociologie, qui s’exprime donc en dehors de son domaine de compétence.
D’autres auteurs, visiblement issus du monde scientifique et de la recherche (en pharmacie, médecine, informatique), ont co-signé l’article. A première vue, on peut donc se dire qu’on va avoir à faire à un travail scientifique juste et rigoureux. Voyons plus en détails.</p>

<h2 id="beaucoup-de-malades-sont-vaccinés">“Beaucoup de malades sont vaccinés”</h2>
<p>Un premier argument de cet article, repris dans plusieurs <a href="https://www.cnews.fr/videos/monde/2021-06-27/israel-40-des-nouveaux-cas-sont-vaccines-1098663">médias</a>, est que “la majorité des personnes hospitalisées pour des formes graves sont désormais des personnes vaccinées.” Ce n’est pas une statistique intéressante. Si tout le monde est vacciné,
la proportion de personnes hospitalisées qui sont vaccinées est de 100%, et ce même s’il n’y a qu’une seule personne concernée. Peut-on pour autant conclure que le vaccin n’est pas efficace ? Non ! C’est une erreur classique appelée “base rate fallacy” dont on a déjà parlé <a href="http://vulgairedev.fr/blog/article/resume-statistique">ici</a>.</p>

<p>En langage commun, on peut traduire cette proportion par la question “quelle est la probabilité d’être vacciné sachant qu’on est hospitalisé ?”, qui peut être écrite p(vacciné|hospitalisé). 
En vérité ce qui nous intéresse ce serait plutôt p(hospitalisé|vacciné), càd le risque d’être hospitalisé sachant qu’on est vacciné, et de le comparer
à la probabilité d’être hospitalisé sachant qu’on est pas vacciné.</p>

<p>Nous allons appliquer la <a href="https://fr.wikipedia.org/wiki/Th%C3%A9or%C3%A8me_de_Bayes">loi de bayes</a>.
Le risque <em>a priori</em> d’être hospitalisé à cause du covid peut être estimé à <a href="https://www.thelancet.com/action/showFullTableHTML?isHtml=true&amp;tableId=tbl3&amp;pii=S1473-3099%2820%2930243-7">6,8%</a> si on suit les premières estimations en début d’épidémie, et de <a href="https://www.cascoronavirus.fr/">8,5%</a> si on divise
le nombre total d’hospitalisations en France par le nombre total de cas détectés.
Notons que la véritable valeur est probablement plus basse,
puisqu’il y a des cas de covid asymptomatiques qui n’ont pas été détectés. Cependant, cette valeur p(hospitalisé) sera la même dans le calcul de p(hospitalisé|vacciné) et de p(hospitalisé|non vacciné), puisqu’on va faire un ratio pour comparer, elle se simplifie et n’impactera pas le résultat.</p>

<p>De plus, en toute rigueur, on devrait préciser dans les notations
que les estimations se font sous l’hypothèse qu’on attrape la covid (on l’enlève par souci de simplification).
Dans ce cas là, on doit estimer la probabilité d’être vacciné sous l’hypothèse d’avoir contracté la covid. On doit donc réappliquer une loi de bayes à l’intérieur de notre calcul.
On a :</p>

\[p(vacciné|covid) = \frac{p(vacciné)}{p(covid)}p(covid|vacciné)\]

<p>Admettons que la covid soit suffisamment virulente pour qu’on considère que la probabilité qu’on finisse par l’attraper soit de 100%. La probabilité a priori d’être vacciné est de 60%. Enfin, d’après l’étude sur <a href="https://www.nejm.org/doi/full/10.1056/nejmoa2034577">le vaccin Pfizer</a>, la probabilité de contracter la covid sachant que l’on est vacciné est de 95%. Pour d’autres vaccins ce serait moins. Admettons que ce soit 70%, pour essayer de prendre en compte que les vaccins sont moins efficaces avec les nouveaux variants. On a alors une probabilité d’être vacciné, sachant qu’on a la covid, de 42%, je vous épargne les détails de calcul (<em>dans la suite des calculs cela correspond à p(vacciné), par soucis de simplification</em>).</p>

<table>
  <tbody>
    <tr>
      <td>Enfin, en france, p(non vacciné</td>
      <td>hospitalisé) = 85% (<a href="https://www.lexpress.fr/actualite/societe/sante/covid-19-en-france-85-des-hospitalises-ne-sont-pas-vaccinees_2155849.html">source</a>)</td>
    </tr>
  </tbody>
</table>

<p>Calculons donc la probabilité d’ếtre hospitalisé sachant qu’on est vacciné, avec les données françaises.</p>

\[p(hospitalisé|vacciné) = \frac{p(hospitalisé)}{p(vacciné)}p(vacciné|hospitalisé)\]

\[= \frac{0.085}{0.42}*0.15\]

\[= 3.0\%\]

<p>Si on applique le même raisonnement pour calculer la probabilité d’être hospitalisé sachant qu’on n’est pas vacciné, on obtient:</p>

\[p(hospitalisé|non vacciné) = \frac{p(hospitalisé)}{p(non vacciné)}p(non vacciné|hospitalisé)\]

\[= \frac{0.085}{0.58}*0.85\]

\[= 12.4\%\]

<p><strong>Attention</strong>, encore une fois, cette estimation est faite en considérant une probabilité de 8,5% d’être hospitalisé si on contracte la covid, cette probabilité est discutable, mais elle ne change pas
le ratio suivant:</p>

<p><strong>En france, actuellement, on a 4 fois plus (12.4 / 3.0) de risques d’être hospitalisé si on n’est pas vacciné, dans l’hypothèse où l’on contracte la covid.</strong></p>

<p><strong>Remarque importante</strong>: il y a en plus au moins un biais supplémentaire dans ces données: on a donné le vaccin prioritairement aux personnes les plus vulnérables. Ainsi, on compare une population vaccinée qui est plus fragile (âge, comorbidités) à une population non-vaccinée plus résistante, ce qui peut avoir tendance à faire baisser les “résultats” du vaccin.</p>

<p>C’est pour cette raison qu’on fait des études expérimentales, où on prend deux groupes de personnes suffisamment grands. L’aléatoire et la taille des groupes permet de faire en sorte qu’ils soient comparables pour d’autres variables qui viendraient influencer les résultats (par exemple l’âge, qui augmente la mortalité).</p>

<p>Ces études existent, puisqu’elles sont nécessaires pour pouvoir attester, de manière objective,
de l’efficacité et de la sûreté d’un vaccin. Par exemple, dans la publication relative au vaccin <a href="https://www.nejm.org/doi/full/10.1056/nejmoa2034577">Pfizer</a>, on a fait deux groupes aléatoires de plus de 21 000 personnes, un groupe auquel on a donné le vaccin, un autre où on a donné un placebo. On a ensuite comparé
les nombres de personnes ayant contracté la covid dans chaque groupe (7 pour le premier, 162 pour l’autre), ce qui nous permet d’estimer (avec un test statistique) que le vaccin protège à 95% du covid, à l’heure de l’étude.</p>

<p>Pourquoi alors semble-t-on dire que le vaccin n’empêche pas d’attraper la covid ? Plusieurs hypothèses sont possibles, 
comme le fait que le virus ait muté, qu’il y a ce biais de donner un vaccin à une population plus vulnérable, qui fait baisser son score, entre autres. Je ne me risquerais pas à en dire plus, ce n’est pas mon domaine de compétence. En tous cas, retenons que les données actuelles nous donnent l’estimation suivante:
on a 4 fois plus de risques d’être hospitalisé à cause du covid en France actuellement si l’on n’est pas vacciné que si on l’est.</p>

<h2 id="confondre-causalité-et-corrélation">Confondre causalité et corrélation</h2>
<p>L’article cite deux chercheurs, qui sont aussi co-signataires: Emanuelle Darles et Vincent Pavant. Dans <a href="https://crowdbunker.com/v/nen8o1aI">cette vidéo</a>, Mr Pavant utilise un modèle (qui peut paraître complexe au premier abord, et même inadéquat au second) et l’adapte à une courbe d’évolution de la mortalité, 
dont il ne prend que la moitié pour ensuite créer de nouvelles données qui l’arrangent, afin de tenter de montrer la pertinence de son modèle. Il conclut ainsi que “le lien entre vaccination et mortalité est certain”.
Ce raisonnement est faux. On ne peut pas prendre une courbe, placer la date de début de vaccination et dire “le nombre de mort augmente après le début de la vaccination, donc la vaccination tue des gens” (c’est finalement ce qu’il fait et ce que font les autres intervenants)
En anglais ce phénomène s’appelle “spurious correlation”, et il y a un site qui les <a href="https://www.tylervigen.com/spurious-correlations">répertorie</a>.</p>

<p>Sans rentrer dans les formalisations mathématiques rigoureuses, on dit que deux variables sont <strong>corrélées</strong> quand elles varient de la même manière.</p>

<p>Par exemple, chez les êtres humains la taille est assez bien corrélée à la masse: plus l’on est grand, plus on a tendance à être lourd, et inversement.
La causalité elle, consiste à dire qu’une variable cause/influence une autre. Par exemple, la quantité d’alcool que j’ingère cause une augmentation de mon taux d’alcool dans le sang. Les recherches de causalités peuvent être des problématiques très difficiles, sur lesquelles travaillent de nombreux chercheurs.</p>

<p>Par exemple ici, on peut voir que la consommation de mozzarella est corrélée au nombre de doctorats en génie civil décernés aux Etats-unis.</p>

<p><img src="/assets/images/chart.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Est ce qu’il y a un lien entre ces deux variables ? Probablement pas. Mais à cause de l’aléatoire de notre monde, on peut trouver des corrélations, par “chance”, sans qu’il y ait de causalité.
De même, il peut y avoir corrélation entre deux variables sans que l’une soit la cause de l’autre, mais plutôt qu’il y ait un autre phénomène caché qui influence ces deux variables.</p>

<p>Par exemple, (repris du livre <a href="https://livre.fnac.com/a8928388/Bruce-Benamran-Prenez-le-temps-d-e-penser">Prenez le temps d’y penser</a>, B. Benamran), les gens qui se couchent avec leurs chaussures ont mal à la tête le lendemain. Est-ce que pour autant le fait de dormir avec ses chaussures cause le mal de tête ?
Non ! Il y a une variable cachée qui est “les personnes qui boivent trop s’endorment avec leurs chaussures”. Ainsi, cette variable cachée a causé l’endormissement avec les chaussures, et le mal de tête.</p>

<p>Revenons à notre épidémie. On constate qu’à partir du moment où on vaccine, la mortalité augmente. Est ce qu’on peut conclure que le vaccin cause la mort ? Non. Très probablement, ce qui se passe c’est que l’épidémie repart vite, donc on vaccine pour éviter des morts du covid.
Le vaccin permet d’éviter des morts, mais il y en a tout de même à cause de l’épidémie. Ici la variable cachée, qui cause la vaccination et l’augmentation du nombre de morts, c’est tout simplement l’épidémie. Notons qu’en toute rigueur, il faudrait valider cette hypothèse expérimentalement.
Ce qui tombe “bien” (si tant est que nous puissions parler ainsi étant donné la situation), c’est que nous avons déjà ces données, puisque certains pays ont beaucoup vacciné, quand d’autres non (<a href="https://twitter.com/nathanpsmad/status/1416732064020369412?s=19&amp;fbclid=IwAR1KZ-sJYoMZi1FtZwClxUS1fP3qITtyX2xOIk82QsDGIrTUkOgpX1i5xhA">source</a>):</p>

<p><img src="/assets/images/comparer_pays.jpeg" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p><strong>Ici on a bien deux groupes aléatoires, de grandes tailles, ce qui nous permet d’avoir une bonne idée de l’influence du vaccin sur la mortalité. Dans celui vacciné on a très peu de décès, dans celui non-vacciné, on en a beaucoup plus.</strong>
Ceci n’est pas une preuve en soi, si on veut être rigoureux, car il faudrait que l’experience se passe dans le même pays, avec le même climat, etc., pour être sûr qu’il n’y ait pas de facteur confondant (biais), mais c’est tout de même très encourageant.</p>

<h2 id="autre-remarques-diverses">Autre remarques diverses</h2>
<ul>
  <li>Dans l’article de Mr Mucchielli, il est assuré que la balance bénéfice-risque pour les jeunes est très mauvaise. Si
on se réfère à la source qui est citée, on se rend compte qu’on compare les risques de la covid chez les jeunes
par rapport au risque du vaccin dans la population générale. On compare donc des choses qui sont différentes, les <strong>conclusions sont
donc fausses</strong>. Dans les rapports qu’ils <a href="https://ansm.sante.fr/uploads/2021/07/16/20210716-vaccins-covid-19-rapport-moderna-periode-28-05-2021-01-07-2021.pdf">citent</a>, on a par exemple qu’un seul cas grave pour les 0-15 ans, et on voit que la médiane des décès pour les vaccinés est de 76,2 ans…</li>
  <li>On fait l’hypothèse que les morts après le vaccin sont liés au vaccin dans cet article. Dans les rapports du CRPV sur le moderna, il est pourtant bien écrit “Aussi ce rapport mensuel présente uniquement les effets indésirables pour lesquels le rôle du vaccin est confirmé ou suspecté”. Encore une fois, <strong>trouver la causalité est difficile</strong>. Dans les conclusions des rapports, il est clairement dit
qu’il n’y a pas de certitude sur le fait que ce soit le vaccin qui cause les morts. Il faut être prudent sur ces affirmations.
Si on donne une banane à manger à 100 000 personnes, il y aura probablement quelques dizaines de personnes qui auront des effets indésirables, et des morts. Doit-on en conclure que les bananes causent la mort ? Non ! Dans le cas présent, je ne suis pas compétent en pharmacologie pour pouvoir juger. Je m’en remet donc aux publications des experts qui concluent que non. Ce que je peux dire par contre, c’est que l’article au mieux se trompe, au pire manipule les données.</li>
  <li>L’article présente volontairement des pourcentages qui font peur. Par exemple pour pfizer, les données sont écrites en absolu, et on écrit subitement un pourcentage: 27.7%. Cette proportion reste en tête, si on lit un peu vite on se dit que les formes graves sont très courantes, alors qu’il s’agit seulement de la proportion d’effets graves parmi les indésirables. En fait, si on calcul le nombre de décès parmis toutes les injections sur pfizer, par exemple, on trouve 0.0018%.</li>
  <li>Si on compte tous les cas graves pour le vaccin, alors il convient de compter aussi tous les cas graves pour le covid (hospitalisation, covid long etc), sinon on ne compare pas la même chose. Ou alors on compare le nombre de morts, et dans ce cas les calculs sont beaucoup plus raisonnables.</li>
  <li>Aucune citation de toute la littérature scientifique qui ne va pas dans le sens du/des auteur(s). C’est assez perturbant quand on voit que l’article se dit de vouloir “observer froidement les données”, et dénonce une “<strong>idéologie</strong> de la vaccination intégrale”.</li>
  <li>Cet article n’est pas un article scientifique revu par les pairs. Les erreurs pointées ici (entre autres) n’auraient pas permis une telle publication dans un journal/une conférence serieux/sérieuse.</li>
</ul>

<h2 id="conclusion">Conclusion</h2>
<p><strong>L’isolation de la causalité est un problème difficile</strong>, c’est une des raisons pour lesquelles des gens passent leur vie à faire de la recherche. Il y a plusieurs réflexes qu’il est bon d’avoir lorsqu’on nous présente des chiffres et conclusions toutes faites:
qui parle ? Ces personnes s’expriment-elles dans leur domaine de compétence ? Avons-nous à faire à un article scientifique revu par les pairs ? Où a-t-il été publié ? 
A quoi correspond concrètement la proportion/la statistique qu’on nous présente ? Et surtout, il faut se méfier des corrélations, qui ne sont pas forcément des causalités. En particulier,
<strong>quand on présente un graphique et qu’on en conclut “parce que ça se voit”, il faut bien réfléchir à ce qu’il y a derrière</strong>. Est ce que c’est une étude expérimentale où on prend deux groupes aléatoires de grandes tailles pour vraiment étudier 
l’impact d’une seule variable, ou sont-ce des données observationnelles (càd observations sans avoir défini un plan d’expérience au préalable, où on ne contrôle pas le processus de génération de données), qui peuvent donc comporter des biais ? Pour une explication visuelle et bien vulgarisée, voir <a href="https://www.youtube.com/watch?v=aOX0pIwBCvw">ici</a>.</p>

<p>On a finalement aussi estimé ici qu’une personne moyenne de la population française, si elle attrape la covid actuellement, a <strong>quatre fois plus de risques d’être hospitalisée si elle n’est pas vaccinée</strong>.
Enfin, les données comparatives entre l’angleterre et la tunisie semblent bien confirmer que <strong>la vaccination protège des risques de décès dus au covid</strong>.</p>

<p><em>NB</em>: Merci à <a href="https://scholar.google.fr/citations?user=GmIYR0UAAAAJ&amp;hl=en">Anes Bendimerad</a>, <a href="https://scholar.google.com/citations?user=l8NPFGcAAAAJ&amp;hl=fr">Aurélie Gabriel</a> et <a href="https://www.linkedin.com/in/nicolas-nativel-42b2a5b4/">Nicolas Nativel</a> pour leurs relectures et critiques.</p>]]></content><author><name>Romain Mathonat</name><email>romain.mathonat@gmail.com</email></author><summary type="html"><![CDATA[Le sujet polarise énormément, je vais donc essayer de m’en tenir au fond pour tenter d’y voir plus clair parmi plusieurs erreurs ou manipulations que j’ai pu voir ces derniers temps. En particulier, un article a récemment été publié sur le blog de mediapart (il n’engage donc pas la rédaction) (MAJ: il a été retiré et republié ici). Il a été rédigé par Laurent Mucchielli, directeur de recherche au CRNS en sociologie, qui s’exprime donc en dehors de son domaine de compétence. D’autres auteurs, visiblement issus du monde scientifique et de la recherche (en pharmacie, médecine, informatique), ont co-signé l’article. A première vue, on peut donc se dire qu’on va avoir à faire à un travail scientifique juste et rigoureux. Voyons plus en détails.]]></summary></entry><entry><title type="html">Analyse propagation COVID-19 au 14/03/20</title><link href="http://vulgairedev.fr/2020/03/14/analyse-coronavirus.html" rel="alternate" type="text/html" title="Analyse propagation COVID-19 au 14/03/20" /><published>2020-03-14T00:00:00+00:00</published><updated>2020-03-14T00:00:00+00:00</updated><id>http://vulgairedev.fr/2020/03/14/analyse-coronavirus</id><content type="html" xml:base="http://vulgairedev.fr/2020/03/14/analyse-coronavirus.html"><![CDATA[<h3 id="attention">Attention</h3>
<p>Je ne suis pas épidémiologiste, mais doctorant en science des données. A la lecture de <a href="https://medium.com/@tomaspueyo/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca">l’excellent article de Thomas Pueyo</a>, j’ai voulu m’inspirer de son travail, avec un code ouvert, dans le cas de la France, en français, à la date du 14 mars 2020. Ce travail n’est PAS un article scientifique revu par les pairs, mais une tentative d’estimation faite en quelques heures. Qui plus est même si le modèle utilisé correspond très bien aux observations, il est possible qu’il soit un peu trop simpliste. A prendre en considération donc.</p>

<h2 id="loi-exponentielle">Loi exponentielle</h2>

<p>On peut modéliser le phénomène de propagation du COVID-19 par une fonction exponentielle. L’article de Wikipedia sur la <a href="https://en.wikipedia.org/wiki/Exponential_growth">croissance exponentielle</a> est bien fait: on peut prouver assez facilement qu’une croissance exponentielle peut s’écrire sous différentes formes:
\(x(t) = x_0e^{kt} = x_0e^{t/\tau} = x_02^{t/T} = x_0(1+\frac{r}{100})^{t/p}\)
avec x(t) le nombre de cas au temps t, x0 le nombre de cas à t=0, k le taux de croissance effectif, tau la période “e-folding”, r le taux de croissance intrinsèque pour une periode p, T la période de “doublage” (le delta temps qu’il faut pour que x(t) double).</p>

<h2 id="modéliser-la-propagation-en-france-au-140320">Modéliser la propagation en France (au 14/03/20)</h2>
<p>Je me base sur les données extraites du <a href="https://www.worldometers.info/coronavirus/">Wordlometer</a>. Visualisons tout d’abord l’évolution du nombre de cas:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="kn">from</span> <span class="nn">scipy.optimize</span> <span class="kn">import</span> <span class="n">curve_fit</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">sns</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="n">rc</span><span class="o">=</span><span class="p">{</span><span class="s">'figure.figsize'</span><span class="p">:(</span><span class="mf">11.7</span><span class="p">,</span><span class="mf">8.27</span><span class="p">)})</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'covid_confirmed.csv'</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># we extract only the timeserie for france
</span><span class="n">france_df</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="n">data</span><span class="p">[</span><span class="s">'Province/State'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'France'</span><span class="p">].</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">4</span><span class="p">:]</span>
<span class="n">france_df</span> <span class="o">=</span> <span class="n">france_df</span><span class="p">.</span><span class="n">T</span>
<span class="n">france_df</span> <span class="o">=</span> <span class="n">france_df</span><span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">france_df</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="n">france_df</span><span class="p">.</span><span class="n">columns</span><span class="p">[</span><span class="mi">0</span><span class="p">]:</span> <span class="s">"Time"</span><span class="p">,</span> <span class="n">france_df</span><span class="p">.</span><span class="n">columns</span><span class="p">[</span><span class="mi">1</span><span class="p">]:</span> <span class="s">"Cases Number"</span> <span class="p">},</span> <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">france_df</span><span class="p">[</span><span class="s">'Time'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">pd</span><span class="p">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">france_df</span><span class="p">[</span><span class="s">'Time'</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">strip</span><span class="p">(),</span> <span class="nb">format</span><span class="o">=</span><span class="s">'%m/%d/%y'</span><span class="p">))</span>
<span class="n">france_df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div>
<style scoped="">
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Time</th>
      <th>Cases Number</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>2020-01-22</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2020-01-23</td>
      <td>0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2020-01-24</td>
      <td>2</td>
    </tr>
    <tr>
      <th>3</th>
      <td>2020-01-25</td>
      <td>3</td>
    </tr>
    <tr>
      <th>4</th>
      <td>2020-01-26</td>
      <td>3</td>
    </tr>
  </tbody>
</table>
</div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sns</span><span class="p">.</span><span class="n">lineplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">france_df</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s">'Time'</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s">'Cases Number'</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;matplotlib.axes._subplots.AxesSubplot at 0x240f0c9a4c8&gt;
</code></pre></div></div>

<p><img src="/assets/images/analyse_coronavirus_4_1.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Maintenant, nous allons construire notre modèle de croissance exponentielle et choisir les paramètres pour qu’il corresponde le mieux possible aux données.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x_numpy</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">france_df</span><span class="p">[</span><span class="s">'Time'</span><span class="p">])</span>

<span class="c1"># we transform date to int to fit the model
</span><span class="n">x_numpy</span> <span class="o">=</span> <span class="p">(</span><span class="n">x_numpy</span> <span class="o">-</span> <span class="n">x_numpy</span><span class="p">[</span><span class="mi">0</span><span class="p">]).</span><span class="n">astype</span><span class="p">(</span><span class="s">'timedelta64[D]'</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="s">'int'</span><span class="p">)</span>
<span class="n">y_numpy</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">france_df</span><span class="p">[</span><span class="s">'Cases Number'</span><span class="p">])</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">exp_func</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">a</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">b</span> <span class="o">*</span> <span class="n">x</span><span class="p">)</span> 

<span class="n">popt</span><span class="p">,</span> <span class="n">pcov</span> <span class="o">=</span> <span class="n">curve_fit</span><span class="p">(</span><span class="n">exp_func</span><span class="p">,</span> <span class="n">x_numpy</span><span class="p">,</span> <span class="n">y_numpy</span><span class="p">,</span> <span class="n">p0</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mf">1e-6</span><span class="p">))</span>



<span class="n">residuals</span> <span class="o">=</span> <span class="n">y_numpy</span> <span class="o">-</span> <span class="n">exp_func</span><span class="p">(</span><span class="n">x_numpy</span><span class="p">,</span> <span class="o">*</span><span class="n">popt</span><span class="p">)</span>
<span class="n">ss_res</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">residuals</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span>

<span class="n">ss_tot</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">((</span><span class="n">y_numpy</span><span class="o">-</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">y_numpy</span><span class="p">))</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span>
<span class="n">r_squared</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">-</span> <span class="p">(</span><span class="n">ss_res</span> <span class="o">/</span> <span class="n">ss_tot</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"R² = </span><span class="si">{</span><span class="n">r_squared</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_numpy</span><span class="p">,</span> <span class="n">y_numpy</span><span class="p">,</span> <span class="s">'ko'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">"Données originales"</span><span class="p">)</span>
<span class="n">label</span> <span class="o">=</span> <span class="s">"{:.3f} * exp({:.3f}*x)"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="o">*</span><span class="n">popt</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_numpy</span><span class="p">,</span> <span class="n">exp_func</span><span class="p">(</span><span class="n">x_numpy</span><span class="p">,</span> <span class="o">*</span><span class="n">popt</span><span class="p">),</span> <span class="s">'b-'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="n">label</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>

</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R² = 0.9976140854576636
</code></pre></div></div>

<p><img src="/assets/images/analyse_coronavirus_7_1.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Notre modèle nous donne donc x0 = 0.017 et k=0.241
Il est simple, et bien sûr criticable, mais semble pourtant bien correspondre à la réalité du moment, avec un R² de 0.997 (à 1 il correspondrait parfaitement à la réalité, il modèlise donc très bien les observations actuelles).</p>

<p><strong>Precision</strong>: La croissance exponentielle ne sera pas infinie bien sûr. Elle sera soit limitée par les mesures que nous prendrons, soit lorsque l’ensemble de personnes non infectées et non-immunisées sera trop faible. On l’a vu en Chine, les mesures de confinement ont changé le modèle, stoppant la croissance exponentielle. D’ici quelques temps les mesures prises ici changeront la courbe à mesure qu’elles auront empéché la propagation.</p>

<p>D’après les équations du début, on peut facilement montrer que:</p>

\[k = \frac{ln(1 + \frac{r}{100})}{p} = \frac{ln(2)}{T}\]

<p>T correspond au temps qu’il faut pour que le nombre de cas double, il vaut ici:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">k</span> <span class="o">=</span> <span class="n">popt</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"T = </span><span class="si">{</span><span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">/</span> <span class="n">k</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>T = 2.880887901598439
</code></pre></div></div>

<p>Le nombre de cas détectés double donc tous les <strong>2.9</strong> jours en France, pour l’instant.<br />
Fixons p à 1 pour connaître le taux de croissance intrinsèque journalier r:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">r</span> <span class="o">=</span> <span class="mi">100</span> <span class="o">*</span> <span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">k</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"r = </span><span class="si">{</span><span class="n">r</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>r = 27.201457983622014
</code></pre></div></div>

<p>Ceci signifie que chaque jour, le nombre de cas détectés augmente de <strong>27%</strong>.</p>

<h2 id="regarder-ailleurs-pour-mieux-prévoir-ici">Regarder ailleurs pour mieux prévoir ici</h2>

<p>On peut considérer que jusqu’à aujourd’hui (14/03/20) les mesures prises par le gouvernement ne sont pas encore effectives (à partir du 16/03/20 pour la fermeture des écoles universités). Il en est de même pour les recommandations du télétravail, puisque l’annonce du gouvernement a eu lieu jeudi soir: les employés des entreprises se sont réunis vendredi pour décider de la meilleure manière de s’adapter à la situation. On peut donc considérer raisonnablement que la courbe va continuer sur la même lancée les prochains jours.</p>

<h3 id="une-latence-pour-limpact-des-mesures-prises">Une latence pour l’impact des mesures prises</h3>
<p>La période d’incubation est estimée à environ 5 jours, en se basant sur ces <a href="https://github.com/midas-network/COVID-19/tree/master/parameter_estimates/2019_novel_coronavirus">publications</a> (section incubation period, je n’ai considéré que les données de publications avec relecture par les pairs). Ceci signifie que les gens ne commencent à avoir les symptômes qu’après 5 jours. Ainsi, toute mesure de confinement prise ne commencera à avoir un impact qu’après ce délai, au mieux: même en enfermant les gens chez eux, s’ils ont attrapé le virus celui-ci se développe et ils seront de toutes façons malades dans 5 jours environ. De plus les premiers symptômes n’envoient pas instantanément les malades dans les hôpitaux, c’est un phénomène plutôt continu, donc cette latence est en fait supérieure à 5 jours. Pour la Chine il fallut environ 11-12 jours après le shutdown de Wuhan (voir graphique 11 <a href="https://medium.com/@tomaspueyo/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca">ici</a>).</p>

<p>C’est cette latence qui “fait mal” en Italie notamment, où les mesures de confinement ont réellement été prises le 9/03: la propagation poursuit sa lancée, le nombre de cas augmente drastiquement, et avec lui le nombre de décès, en ayant l’impression d’être impuissants face à l’ampleur du phénomène. Les mesures de confinement ralentiront le phénomène, simplement il faut passer le cap de cette “latence”.</p>

<h3 id="prédire-le-nombre-de-cas-détectés-les-prochains-jours">Prédire le nombre de cas détectés les prochains jours.</h3>
<p>Etant donnée cette latence, le nombre d’infectés va probablement continuer d’une manière similaire. Ainsi, on va pouvoir donner une estimation grossière du nombre de cas en utilisant notre modèle. Faisons le sur 12 jours, c’est à dire la durée qui a été nécessaire en chine pour que les mesures aient un impact.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X_test</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">52</span><span class="p">,</span><span class="mi">65</span><span class="p">))</span>
<span class="n">y_pred</span> <span class="o">=</span> <span class="n">exp_func</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="o">*</span><span class="n">popt</span><span class="p">)</span>

<span class="n">day_of_month</span> <span class="o">=</span> <span class="mi">14</span>

<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">y_pred</span><span class="p">):</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Le </span><span class="si">{</span><span class="n">day_of_month</span> <span class="o">+</span> <span class="n">i</span><span class="si">}</span><span class="s"> mars on peut prédire un nombre de cas diagnostiqués d'environ </span><span class="si">{</span><span class="nb">int</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="si">}</span><span class="s">."</span><span class="p">)</span>

</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Le 14 mars on peut prédire un nombre de cas diagnostiqués d'environ 4597.
Le 15 mars on peut prédire un nombre de cas diagnostiqués d'environ 5848.
Le 16 mars on peut prédire un nombre de cas diagnostiqués d'environ 7439.
Le 17 mars on peut prédire un nombre de cas diagnostiqués d'environ 9463.
Le 18 mars on peut prédire un nombre de cas diagnostiqués d'environ 12037.
Le 19 mars on peut prédire un nombre de cas diagnostiqués d'environ 15311.
Le 20 mars on peut prédire un nombre de cas diagnostiqués d'environ 19476.
Le 21 mars on peut prédire un nombre de cas diagnostiqués d'environ 24774.
Le 22 mars on peut prédire un nombre de cas diagnostiqués d'environ 31513.
Le 23 mars on peut prédire un nombre de cas diagnostiqués d'environ 40085.
Le 24 mars on peut prédire un nombre de cas diagnostiqués d'environ 50989.
Le 25 mars on peut prédire un nombre de cas diagnostiqués d'environ 64859.
Le 26 mars on peut prédire un nombre de cas diagnostiqués d'environ 82501.
</code></pre></div></div>

<h3 id="estimer-le-nombre-de-cas-rééls-actuels">Estimer le nombre de cas rééls actuels</h3>
<p>Bien évidemment étant donné le nombre limité de personnes que l’on peut diagnostiquer, et du fait que la majorité des personnes tombe malade sans forcement savoir qu’elles sont atteintes du COVID-19, le nombre de personne réellement malades est bien supérieur. On peut reprendre l’analyse de <a href="https://medium.com/@tomaspueyo/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca">Thomas Pueyo</a> avec les données actuelles, pour donner une estimation du nombre de cas actuels rééls dans la nature. Le délai moyen entre l’infection et la mort est en moyenne d’environ 17 jours (voir <a href="https://docs.google.com/spreadsheets/d/17YyCmjb2Z2QwMiRRwAb7W0vQoEAiL9Co0ARsl03dSlw/copy?usp=sharing">ici</a>).</p>

<p>Dans de bonnes conditions de soins comme en France (pour le moment), on a une mortalité probablement d’au moins 2% (nombres de personnes décédés / nombre de personnes atteintes). “Au moins” car les personnes atteintes peuvent encore décéder avant d’être des cas “clos”. Ainsi, pour x décès aujourd’hui, on peut estimer à <strong>environ x * 100/2 = 4550 personnes malades il y a 17 jours. Aujourd’hui nous avons 4499 malades déclarés.</strong> Cette estimation du nombre de cas d’il y a 17 jours parait donc  plausible au vu des données actuelles. Etant donné qu’il n’y a pas d’impact des mesures prises par le gouvernement entre il y a 17 jours et maintenant, on peut appliquer le taux de croissance de notre modèle pour estimer le nombre de malades aujourd’hui (on fait donc l’hypothèse que la proportion de personnes malades se faisant diagnostiquer est constante). On trouve ainsi qu’ajourd’hui nous avons comme estimation du nombre de cas actuels, puisqu’il y a actuellement <strong>91</strong> personnes décédées:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">died</span> <span class="o">=</span> <span class="mi">91</span>
<span class="n">x_17_before</span> <span class="o">=</span> <span class="n">died</span> <span class="o">*</span> <span class="mi">100</span> <span class="o">/</span> <span class="mi">2</span>

<span class="n">estimation_real_cases</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">x_17_before</span><span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="nb">float</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o">/</span> <span class="mi">100</span><span class="p">)</span> <span class="o">**</span> <span class="mi">17</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Aujourd'hui nous avons donc environ </span><span class="si">{</span><span class="n">estimation_real_cases</span><span class="si">}</span><span class="s"> nombre de cas."</span><span class="p">)</span>

</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Aujourd'hui nous avons donc environ 271879 nombre de cas.
</code></pre></div></div>

<p>Le nombre de personnes actuellement atteintes par le virus dépend du taux de mortalité, mais cette estimation nous dit
qu’elle se compte déjà probablement en <strong>centaines de millliers</strong>.</p>

<p>Notons que l’estimation de la mortalité peut grimper jusqu’à 4-5% dans le cas où les moyens présents pour les malades ne sont pas suffisants (voir Hubei).</p>

<h2 id="conclusion">Conclusion</h2>
<p><strong>Il ne faut pas prendre la situation à la légère.</strong> 
<strong>Limiter la propagation est une question de vie ou de mort pour les personnes fragiles</strong>. Empêcher la surcharge des hôpitaux est surement le plus important pour limiter le nombre de décès.</p>

<p><strong>Conseils</strong>: Nettoyez vos téléphones. Une part très importante de la population utilise son téléphone de manière compulsive toute la journée. L’utiliser, se laver les mains puis le réutiliser amène à de la contamination: nettoyez vos téléphones ! Et restez chez vous, autant que possible.</p>

<p><strong>Update du 15/03/20</strong>: On peut se mouiller un peu plus. Ce dimanche 15 mars, les élections municipales ont été maintenues. Qui plus est il semblerait que beaucoup de gens aient voulu “féter” le dernier soir d’ouverture des bars ce samedi. Il est donc raisonnable de penser qu’avant un ralentissement, dans environ 12-15 jours, il y aura un superpic de cas dû à ces deux évènements. Ceci sera problématique, puisque on aprochera probablement des 80k cas diagnostiqués, la population pourrait commencer à mal vivre l’isolement, et surtout à avoir une sensation d’inutilité des mesures, à cause de ce délai. Il faut que les gens comprennent ce phénomène de latence, pour mieux réagir.</p>]]></content><author><name>Romain Mathonat</name><email>romain.mathonat@gmail.com</email></author><summary type="html"><![CDATA[Attention Je ne suis pas épidémiologiste, mais doctorant en science des données. A la lecture de l’excellent article de Thomas Pueyo, j’ai voulu m’inspirer de son travail, avec un code ouvert, dans le cas de la France, en français, à la date du 14 mars 2020. Ce travail n’est PAS un article scientifique revu par les pairs, mais une tentative d’estimation faite en quelques heures. Qui plus est même si le modèle utilisé correspond très bien aux observations, il est possible qu’il soit un peu trop simpliste. A prendre en considération donc.]]></summary></entry><entry><title type="html">Applied Data Science: Subgroup Discovery on Mushrooms</title><link href="http://vulgairedev.fr/2019/10/16/mushrooms.html" rel="alternate" type="text/html" title="Applied Data Science: Subgroup Discovery on Mushrooms" /><published>2019-10-16T00:00:00+00:00</published><updated>2019-10-16T00:00:00+00:00</updated><id>http://vulgairedev.fr/2019/10/16/mushrooms</id><content type="html" xml:base="http://vulgairedev.fr/2019/10/16/mushrooms.html"><![CDATA[<p>My last publication was on Subgroup Discovery for Sequences (you can access it freely <a href="https://www.researchgate.net/publication/336315710_SeqScout_Using_a_Bandit_Model_to_Discover_Interesting_Subgroups_in_Labeled_Sequences">here</a>). However, in Data Science community, a lot of people are not aware of what “Subgroup Discovery” or “Pattern Mining” is. So let’s see on a quick pratical example how to use it : knowing if Mushrooms are poisonous.</p>

<h2 id="what-is-subgroup-discovery-">What is Subgroup Discovery ?</h2>
<p>Subgroup Discovery, Emerging Patterns, Contrast Set, or Discriminative Pattern Mining all refer to the same idea: finding patterns that are discriminative of a target class. In other words, the aim is to find predictive <strong>interpretable</strong> rules of a class. 
As an example, Herrera et al used Subgroup Discovery in the context of a <a href="https://www.researchgate.net/profile/Luis_Jimenez-Trevino/publication/220176495_Evolutionary_fuzzy_rule_extraction_for_subgroup_discovery_in_a_psychiatric_emergency_department/links/0f317530f622867ad7000000/Evolutionary-fuzzy-rule-extraction-for-subgroup-discovery-in-a-psychiatric-emergency-department.pdf">pyschiatric emergency department</a>. They found rules like:</p>
<ul>
  <li>If Sex=Male and DAY=Monday -&gt; Suicide</li>
  <li>If Sex=Female and (DAY=SUNDAY or DAY=MONDAY) and TIME=LATE_EVENING -&gt; Suicide</li>
</ul>

<p>Of course these rules are not correct 100% of the time, but they tell you that <strong>when a pattern appears, there are more chances that the class appears too</strong>.</p>

<p>This is interesting for two reasons:</p>
<ul>
  <li><strong>Understanding your data</strong> in a way that is interpretable by an expert.</li>
  <li>Using those patterns to <strong>improve classification</strong> or regression algorithms. Indeed, as they are discriminative of a target class, you can use them as features to improve classical supervised learning.</li>
</ul>

<p>Subgroup discovery can then be used to improve your system, <em>thanks to interpretability</em>: knowing that people have more suicidal thoughts on Sunday and Monday, particulary at night, you can engage more psychology support workers during those periods of time in the department, for example.</p>

<h2 id="lets-try-it-on-mushroom">Let’s try it on Mushroom</h2>
<p><a href="https://archive.ics.uci.edu/ml/datasets/mushroom">Mushroom</a> is a famous dataset which contains characteristics of different species of mushrooms: its odor, color, habitat etc. More importantly, there is also the information if a mushroom is edible, or not.</p>

<p>This will be our target class: when using a subgroup discovery algorithm, we consider a dataset and a target class, and the algorithm returns a set of rules discriminative of this class. Here, we are looking for <em>patterns</em> discriminative of Poisonous mushrooms. In other words, we want to find the <strong>conjunction of features that are characteristics of poisonous mushrooms</strong>.</p>

<p>First, let’s install <a href="https://github.com/flemmerich/pysubgroup">pysubgroup</a> package, which is an implementation of several subgroup discovery algorithms, in Python:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>pysubgroup
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pysubgroup</span> <span class="k">as</span> <span class="n">ps</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
</code></pre></div></div>

<p>Let’s take a look at the dataset.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"./mushroom.csv"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">describe</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        EDIBLE CAP-SHAPE CAP-SURFACE CAP-COLOR BRUISES  ODOR GILL-ATTACHEMENT  \
count     8417      8416        8416      8416    8416  8416             8416   
unique       3         6           4        10       2     9                2   
top     EDIBLE    CONVEX       SCALY     BROWN      NO  NONE             FREE   
freq      4488      3796        3268      2320    5040  3808             8200   

       GILL-SPACING GILL-SIZE GILL-COLOR  ... STALK-SURFACE-BELOW-RING  \
count          8416      8416       8416  ...                     8416   
unique            2         2         12  ...                        4   
top           CLOSE     BROAD       BUFF  ...                   SMOOTH   
freq           6824      5880       1728  ...                     5076   

       STALK-COLOR-ABOVE-RING STALK-COLOR-BELOW-RING VEIL-TYPE VEIL-COLOR  \
count                    8416                   8416      8416       8416   
unique                      9                      9         1          4   
top                     WHITE                  WHITE   PARTIAL      WHITE   
freq                     4744                   4640      8416       8216   

       RING-NUMBER RING-TYPE SPORE-PRINT-COLOR POPULATION HABITAT  
count         8416      8416              8416       8416    8416  
unique           3         5                 9          6       7  
top            ONE   PENDANT             WHITE    SEVERAL   WOODS  
freq          7768      3968              2424       4064    3160  

[4 rows x 23 columns]
</code></pre></div></div>

<p>We now have to specify the target class: in our case, it’s the column ‘EDIBLE’, when it takes the value ‘POISONOUS’. Mind that we should remove this column from the data (the set of features), or otherwise it will be considered as a feature, resulting in rules like POISONOUS -&gt; POISONOUS.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">target</span> <span class="o">=</span> <span class="n">ps</span><span class="p">.</span><span class="n">BinaryTarget</span> <span class="p">(</span><span class="s">'EDIBLE'</span><span class="p">,</span> <span class="s">'POISONOUS'</span><span class="p">)</span>
<span class="n">searchspace</span> <span class="o">=</span> <span class="n">ps</span><span class="p">.</span><span class="n">create_selectors</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">ignore</span><span class="o">=</span><span class="p">[</span><span class="s">'EDIBLE'</span><span class="p">])</span>
</code></pre></div></div>

<p>Then we have to create a Subgroup Discovery Task. In particular, we have to specify three parameters:</p>
<ul>
  <li>the <strong>number of rules</strong> we want to extract (result_set_size),</li>
  <li>the <strong>maximum size of the rule</strong> (depth),</li>
  <li>the <strong>quality measure</strong>. If you do not know what a quality measure is and which one is better for your task, you can take the Weighted Relative Accuracy (WRAcc), which is one of the most popular of the domain.</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">task</span> <span class="o">=</span> <span class="n">ps</span><span class="p">.</span><span class="n">SubgroupDiscoveryTask</span> <span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">searchspace</span><span class="p">,</span> 
            <span class="n">result_set_size</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">depth</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">qf</span><span class="o">=</span><span class="n">ps</span><span class="p">.</span><span class="n">WRAccQF</span><span class="p">())</span>
</code></pre></div></div>

<p>Finally, we have to choose an algorithm to mine the rules. By default we can use the popular <a href="https://en.wikipedia.org/wiki/Beam_search">beam search</a>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">results</span> <span class="o">=</span> <span class="n">ps</span><span class="p">.</span><span class="n">BeamSearch</span><span class="p">().</span><span class="n">execute</span><span class="p">(</span><span class="n">task</span><span class="p">)</span>
</code></pre></div></div>

<p>Finally, we print the rules we have got:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">results</span><span class="p">.</span><span class="n">to_dataframe</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.19389014936350082: &lt;&lt;T: EDIBLE=POISONOUS; D: BRUISES=NO AND GILL-SPACING=CLOSE AND VEIL-TYPE=PARTIAL AND VEIL-COLOR=WHITE&gt;&gt;
0.19389014936350082: &lt;&lt;T: EDIBLE=POISONOUS; D: BRUISES=NO AND GILL-SPACING=CLOSE AND VEIL-COLOR=WHITE&gt;&gt;
0.19236944009552903: &lt;&lt;T: EDIBLE=POISONOUS; D: BRUISES=NO AND GILL-SPACING=CLOSE AND GILL-ATTACHEMENT=FREE AND VEIL-COLOR=WHITE&gt;&gt;
0.19236944009552903: &lt;&lt;T: EDIBLE=POISONOUS; D: BRUISES=NO AND GILL-SPACING=CLOSE AND GILL-ATTACHEMENT=FREE&gt;&gt;
0.19236944009552903: &lt;&lt;T: EDIBLE=POISONOUS; D: BRUISES=NO AND GILL-SPACING=CLOSE AND VEIL-TYPE=PARTIAL AND GILL-ATTACHEMENT=FREE&gt;&gt;
</code></pre></div></div>

<p>It is important to know that the WRAcc takes its values in a range [-0.25;0.25] on a balanced dataset.
Therefore, a value of 0.1938 is very good: this means that this pattern is highly discriminative of poisonous mushrooms.</p>

<p>Let’s take a look at the first example that we have got. We learn that if a mushroom has a close gill-spacing, a veil-type partial and white, and has no bruises, then very likely, it is a poisonous one.</p>

<p>This is perfectly interpretable for an expert. The following picture shows what are the gill-spacing (or Hymenium here) and the Veil.
<img src="http://www.toxinology.com/generic_static_files/images_generic/MD-fig1A-annulus-volva.gif" alt="Gill-spacing illustration" /></p>

<p>Let’s take an example with the famous <a href="https://en.wikipedia.org/wiki/Amanita_phalloides">Amanita phalloides</a>. As you can see on the picture below, this mushroom has no bruises (I guess ? I am not an expert in mushrooms actually :) ), a close gill-spacing, and a white and partial veil. The rule tells you it is probably poisonous, and it is: Amanita phalloides is one of the most toxic mushrooms !</p>

<p><img src="https://upload.wikimedia.org/wikipedia/commons/9/99/Amanita_phalloides_1.JPG" alt="Amanita phalloides" /></p>

<p>That’s it, we have extracted useful <strong>knowledge</strong> from our dataset, and we can now use it to better understand our system.</p>

<p><strong>Note</strong>: The documentation of pysubgroup is lacking, but hopefully it will improve in the future.</p>

<p><strong>Note</strong>: There are also other ways to extract interpretable rules, for example training a decision tree and extracting the path taken in the tree can give a pattern explaining the prediction. Clustering can also group similar elements, and finding frequent pattern between them can create interpretable rules.</p>

<p>The advantage of subgroup discovery over those methods is that it has been made to give those rules, whereas in those other methods it is not the main purpose of the algorithm. Here, you have more control over what kind of rules you want to propose to the end-user, particulary because you can choose the <em>Quality Measure</em> you want to use. In this formalism you can also use exhaustive algorithm to list all possible rules, which is not the case in a decision tree, greedy by nature.</p>]]></content><author><name>Romain Mathonat</name><email>romain.mathonat@gmail.com</email></author><summary type="html"><![CDATA[My last publication was on Subgroup Discovery for Sequences (you can access it freely here). However, in Data Science community, a lot of people are not aware of what “Subgroup Discovery” or “Pattern Mining” is. So let’s see on a quick pratical example how to use it : knowing if Mushrooms are poisonous.]]></summary></entry><entry><title type="html">TDD en python pour débutants</title><link href="http://vulgairedev.fr/2019/09/12/tdd.html" rel="alternate" type="text/html" title="TDD en python pour débutants" /><published>2019-09-12T00:00:00+00:00</published><updated>2019-09-12T00:00:00+00:00</updated><id>http://vulgairedev.fr/2019/09/12/tdd</id><content type="html" xml:base="http://vulgairedev.fr/2019/09/12/tdd.html"><![CDATA[<h2 id="contexte">Contexte</h2>
<p>Dans la vie réelle, les applications informatiques durent dans le temps (on ne jette pas le code à la fin de la journée contrairement à un TP). De plus, les spécifications et les entrées du programme évoluent. A partir du moment où le code contient plus de 2 ou 3 fonctions, il va falloir faire attention aux “effets de bords”, c-à-d que la modification du programme pour répondre à cette nouvelle spécification ne détruise pas d’autres fonctionnalités du logiciel.</p>

<h2 id="solution">Solution</h2>
<p>Le Test Driven Development (TDD) est un paradigme (“une façon de faire”) où on cherche à écrire les tests d’un code informatique avant d’écrire ledit code. Ainsi, lorsqu’on voudra changer le code, il suffira d’écrire de nouveaux tests pour tester les nouveaux cas, et relancer les anciens tests. On minimise les erreurs en se forçant à faire des fonctions courtes, qui répondent à une spécification précise dont on test les cas limites le plus possible. En général ça permet de faire du meilleur code, plus maintenable, plus concis, mieux testé.</p>

<p>Le cycle du TDD est le suivant:</p>

<ol>
  <li>Ecrire le test</li>
  <li>Lancer les tests. Ca doit échouer</li>
  <li>Ecrire le code</li>
  <li>Lancer les tests. Ca doit fonctionner</li>
  <li>Refactor. La modification du programme peut faire qu’il faille le “nettoyer” pour qu’il soit plus simple à maintenir à l’avenir.</li>
</ol>

<h2 id="activité--fizzbuzz">Activité:  FizzBuzz</h2>
<p>Pour faire nos tests, nous utiliserons <a href="https://docs.pytest.org/en/latest/getting-started.html">pytest</a>. L’arborescence des fichiers est simple:<br />
├── TDD_example<br />
│   ├── <code class="language-plaintext highlighter-rouge">fizzbuzz.py</code><br />
│   └── <code class="language-plaintext highlighter-rouge">test_fizzbuzz.py</code></p>

<h2 id="cycle-numéro-1">Cycle numéro 1</h2>
<p>Le programme doit fonctionner de la manière suivante:<br />
<strong>Entrée</strong>: 1<br />
<strong>Sortie</strong>: 1</p>

<p>Lancer le cycle TDD: Ecrire les tests, les lancer, écrire le code, relancer les test.</p>

<p><strong>Solution</strong>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># test_fizzbuzz.py
</span><span class="kn">from</span> <span class="nn">fizzbuzz</span> <span class="kn">import</span> <span class="n">fizzbuzz</span>  
  
<span class="k">def</span> <span class="nf">test_process_number</span><span class="p">():</span>  
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>  

</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># fizzbuzz.py
</span><span class="k">def</span> <span class="nf">fizzbuzz</span><span class="p">(</span><span class="n">number</span><span class="p">):</span>  
    <span class="k">if</span> <span class="n">number</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
        <span class="k">return</span> <span class="mi">1</span>  
</code></pre></div></div>

<p>Pour lancer les tests avec pytest c’est simple, en étant dans le répertoire:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pytest
</code></pre></div></div>

<p>On écrit le code minimal qui répond à la spécification. On lance les tests. Si tout fonctionne, on a fait un cycle de TDD.</p>

<h2 id="cycle-numéro-2">Cycle numéro 2</h2>
<p><strong>Entrée</strong>: 1, 2 (1 ou 2)<br />
<strong>Sortie</strong>: 1, 2</p>

<p><strong>Solution</strong>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># test_fizzbuzz.py
</span><span class="kn">from</span> <span class="nn">fizzbuzz</span> <span class="kn">import</span> <span class="n">fizzbuzz</span>  
  
<span class="k">def</span> <span class="nf">test_process_number</span><span class="p">():</span>  
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span>  
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># fizzbuzz.py
</span><span class="k">def</span> <span class="nf">fizzbuzz</span><span class="p">(</span><span class="n">number</span><span class="p">):</span>  
    <span class="k">return</span> <span class="n">number</span>  
</code></pre></div></div>

<p>On a modifié fizzbuzz, il répond à la nouvelle spécification, mais on vérifie aussi (et facilement) que les spécifications précédentes sont validées. On a la garantie qu’on n’a pas cassé le fonctionnement du programme testé.</p>

<h2 id="cycle-numéro-3">Cycle numéro 3</h2>
<p><strong>Entrée</strong>: 1, 2,3<br />
<strong>Sortie</strong>: 1, 2, fizz</p>

<p><strong>Solution</strong>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># test_fizzbuzz.py
</span><span class="kn">from</span> <span class="nn">fizzbuzz</span> <span class="kn">import</span> <span class="n">fizzbuzz</span>  
  
<span class="k">def</span> <span class="nf">test_process_number</span><span class="p">():</span>  
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="o">==</span> <span class="s">'fizz'</span>  
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># fizzbuzz.py
</span><span class="k">def</span> <span class="nf">fizzbuzz</span><span class="p">(</span><span class="n">number</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">number</span> <span class="o">==</span> <span class="mi">3</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">'fizz'</span>  
    <span class="k">return</span> <span class="n">number</span>  
</code></pre></div></div>
<p>On a un nouveau cas, qu’on gère facilement avec un if.</p>

<h2 id="cycle-numéro-4">Cycle numéro 4</h2>

<p><strong>Entrée</strong>: 1, 2, 3, 5<br />
<strong>Sortie</strong>: 1, 2, fizz, buzz</p>

<p><strong>Solution</strong>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># test_fizzbuzz.py
</span><span class="kn">from</span> <span class="nn">fizzbuzz</span> <span class="kn">import</span> <span class="n">fizzbuzz</span>  
  
<span class="k">def</span> <span class="nf">test_process_number</span><span class="p">():</span>  
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="o">==</span> <span class="s">'fizz'</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> <span class="o">==</span> <span class="s">'buzz'</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># fizzbuzz.py
</span><span class="k">def</span> <span class="nf">fizzbuzz</span><span class="p">(</span><span class="n">number</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">number</span> <span class="o">==</span> <span class="mi">3</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">'fizz'</span>
    <span class="k">if</span> <span class="n">number</span> <span class="o">==</span> <span class="mi">5</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">'buzz'</span>  
    <span class="k">return</span> <span class="n">number</span>  
</code></pre></div></div>
<p>Encore un nouveau cas, qu’on a géré avec un autre if.</p>

<h2 id="cycle-numéro-5">Cycle numéro 5</h2>
<p><strong>Entrée</strong>: 1, 2, 3, 5, 6, 10<br />
<strong>Sortie</strong>: 1, 2, fizz, buzz, fizz, buzz</p>

<p><strong>Solution</strong>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># test_fizzbuzz.py
</span><span class="kn">from</span> <span class="nn">fizzbuzz</span> <span class="kn">import</span> <span class="n">fizzbuzz</span>  
  
<span class="k">def</span> <span class="nf">test_process_number</span><span class="p">():</span>  
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="o">==</span> <span class="s">'fizz'</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> <span class="o">==</span> <span class="s">'buzz'</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">6</span><span class="p">)</span> <span class="o">==</span> <span class="s">'fizz'</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="o">==</span> <span class="s">'buzz'</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># fizzbuzz.py
</span><span class="k">def</span> <span class="nf">fizzbuzz</span><span class="p">(</span><span class="n">number</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">number</span> <span class="o">%</span> <span class="mi">3</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">'fizz'</span>
    <span class="k">if</span> <span class="n">number</span> <span class="o">%</span> <span class="mi">5</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">'buzz'</span>  
    <span class="k">return</span> <span class="n">number</span>  
</code></pre></div></div>
<p>Cette fois-ci on se rend compte que c’est les multiples de 3 qui doivent retourner “fizz” et les multiples de 5 qui doivent donner “buzz”.</p>

<h2 id="cycle-numéro-6">Cycle numéro 6</h2>

<p><strong>Entrée</strong>: 1, 2, 3, 5, 6, 10, 15<br />
<strong>Sortie</strong>: 1, 2, fizz, buzz, fizz, buzz, fizzbuzz</p>

<p><strong>Solution</strong>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># test_fizzbuzz.py
</span><span class="kn">from</span> <span class="nn">fizzbuzz</span> <span class="kn">import</span> <span class="n">fizzbuzz</span>  
  
<span class="k">def</span> <span class="nf">test_process_number</span><span class="p">():</span>  
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="o">==</span> <span class="s">'fizz'</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> <span class="o">==</span> <span class="s">'buzz'</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">6</span><span class="p">)</span> <span class="o">==</span> <span class="s">'fizz'</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="o">==</span> <span class="s">'buzz'</span>
    <span class="k">assert</span> <span class="n">fizzbuzz</span><span class="p">(</span><span class="mi">15</span><span class="p">)</span> <span class="o">==</span> <span class="s">'fizzbuzz'</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># fizzbuzz.py
</span><span class="k">def</span> <span class="nf">fizzbuzz</span><span class="p">(</span><span class="n">number</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">number</span> <span class="o">%</span> <span class="mi">3</span> <span class="o">==</span> <span class="mi">0</span> <span class="ow">and</span> <span class="n">number</span> <span class="o">%</span> <span class="mi">5</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">'fizzbuzz'</span>  
    <span class="k">if</span> <span class="n">number</span> <span class="o">%</span> <span class="mi">3</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">'fizz'</span>
    <span class="k">if</span> <span class="n">number</span> <span class="o">%</span> <span class="mi">5</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">'buzz'</span>
    <span class="k">return</span> <span class="n">number</span>  
</code></pre></div></div>
<p>On a encore un nouveau cas: les nombres multiples de 3 et 5 doivent afficher ‘fizzbuzz’. On le gère dans ce nouveau cycle TDD</p>

<p>Les tests fonctionnent bien, on peut “refactor” le code pour avoir quelque chose de plus élégant. On ajoute une doc pour expliquer ce que fait la fonction, utile quand on voudra reprendre le code des mois/années plus tard ou pour expliquer rapidement à un autre développeur qui travaillerait sur le projet.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># fizzbuzz.py
</span><span class="k">def</span> <span class="nf">fizzbuzz</span><span class="p">(</span><span class="n">number</span><span class="p">):</span>
    <span class="s">'''
    :param number: number
    :return: 'fizz' if number is multiple of 3, 'buzz' if number is multiple of 5, 'fizzbuzz' is multiple of both, or number in the default case.
    '''</span>
    <span class="n">multiple_3</span> <span class="o">=</span> <span class="n">number</span> <span class="o">%</span> <span class="mi">3</span> <span class="o">==</span> <span class="mi">0</span>
    <span class="n">multiple_5</span> <span class="o">=</span> <span class="n">number</span> <span class="o">%</span> <span class="mi">5</span> <span class="o">==</span> <span class="mi">0</span>
    
    <span class="k">if</span> <span class="n">multiple_3</span> <span class="ow">and</span> <span class="n">multiple_5</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">'fizzbuzz'</span>  
    <span class="k">elif</span> <span class="n">multiple_3</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">'fizz'</span>
    <span class="k">elif</span> <span class="n">multiple_5</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">'buzz'</span>
    <span class="k">return</span> <span class="n">number</span>  
</code></pre></div></div>]]></content><author><name>Romain Mathonat</name><email>romain.mathonat@gmail.com</email></author><summary type="html"><![CDATA[Contexte Dans la vie réelle, les applications informatiques durent dans le temps (on ne jette pas le code à la fin de la journée contrairement à un TP). De plus, les spécifications et les entrées du programme évoluent. A partir du moment où le code contient plus de 2 ou 3 fonctions, il va falloir faire attention aux “effets de bords”, c-à-d que la modification du programme pour répondre à cette nouvelle spécification ne détruise pas d’autres fonctionnalités du logiciel.]]></summary></entry><entry><title type="html">SAX: Piecewise Aggregate Approximation</title><link href="http://vulgairedev.fr/2019/03/21/paa.html" rel="alternate" type="text/html" title="SAX: Piecewise Aggregate Approximation" /><published>2019-03-21T00:00:00+00:00</published><updated>2019-03-21T00:00:00+00:00</updated><id>http://vulgairedev.fr/2019/03/21/paa</id><content type="html" xml:base="http://vulgairedev.fr/2019/03/21/paa.html"><![CDATA[<p><strong>Problem</strong>: We have a series of n numbers that we want to divide into w slots. We want to compute the mean of each slot, how do we proceed when n is not divisible by w ? This is called a Piecewise Aggregate Approximation (PAA).</p>

<p>This question appeared when I read the <a href="https://cs.gmu.edu/~jessica/SAX_DAMI_preprint.pdf">SAX algorithm</a>. It is used to convert a time series to a sequence of symbols. The trick is briefly explained in the paper, but the implementation requires a bit of thinking. Following schema is taken from the original paper, <a href="https://cs.gmu.edu/~jessica/SAX_DAMI_preprint.pdf">Experiencing SAX: a Novel Symbolic Representation of Time Series</a></p>

<p><img src="/assets/images/sax.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>The natural way would be to consider each slot to be the size of the following equation (n//w is the floor division) \(\frac{n}{w} = n // w + \frac{n \% w}{w}\). 
We start from index 0, we add n//w points, then we add a proportion of the next point, corresponding to (n%w)/w. For the second slot, we take the rest of the proportion of the previous point, we add n//w point, then the rest of the proportion of the last point so that the size of the slot is n/w. We keep doing this process until we reach the end of the series. <br />
The issue with this strategy is that is quite difficult and inelegant to code.</p>

<p>There is a more elegant way. If we multiply n by w, we can consider that we repeat each point w times. What is the point of doing this ? We saw that the proportion of point we need to add to the current slot is a quantity that we will divide by w (it is (n%w)/w). Considering we repeat each point w times, we will be able to deal with a number of points that is an integer.</p>

<p><img src="/assets/images/paa_transform.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Now the slot does not have a size of n/w, but n. We sum each element of the slot, then we divide by n at the end: we will indeed get the mean of the slot. For example, the second slot in the picture above has a mean of (s[i] corresponds to the ith element of the serie):<br />
\(\frac{\frac{2}{3}s[2] +\frac{2}{3}s[3]}{\frac{4}{3}}\)</p>

<p>Considering the new representation, the mean is:<br />
 \(\frac{2s[2] +2s[3]}{4}\)</p>

<p>which is the same.</p>

<p>Now the code:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">paa</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">w</span><span class="p">):</span>
    <span class="n">res</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">w</span>
    <span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span>  <span class="nb">range</span><span class="p">(</span><span class="n">w</span> <span class="o">*</span> <span class="n">n</span><span class="p">):</span>
        <span class="n">idx</span> <span class="o">=</span> <span class="n">i</span> <span class="o">//</span> <span class="n">n</span>
        <span class="n">pos</span> <span class="o">=</span> <span class="n">i</span> <span class="o">//</span> <span class="n">w</span>
        <span class="n">res</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> <span class="o">+=</span> <span class="n">s</span><span class="p">[</span><span class="n">pos</span><span class="p">]</span>
    <span class="n">res</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span> <span class="o">/</span> <span class="n">n</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">res</span><span class="p">]</span>
    <span class="k">return</span> <span class="n">res</span>
    
<span class="k">print</span><span class="p">(</span><span class="n">paa</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span>  <span class="o">-</span><span class="mi">4</span><span class="p">],</span> <span class="mi">4</span><span class="p">))</span>
<span class="c1"># &gt;&gt;&gt; [1.3333333333333333, 2.4444444444444446, 4.888888888888889, -0.6666666666666666]
</span></code></pre></div></div>
<p>The following plot shows what the PAA looks like:</p>

<p><img src="/assets/images/paa_plot.png" alt="" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>NB: Note however that this method has the disadvantage of increasing a lot the number of iterations. If you consider a long timeseries, it may be too long. If it is the case, you can choose to simply consider slot to have
size n // w. If w is not divisble by w, we will have w + 1 slot, the last one having less points (you need to be aware of that).</p>]]></content><author><name>Romain Mathonat</name><email>romain.mathonat@gmail.com</email></author><summary type="html"><![CDATA[Problem: We have a series of n numbers that we want to divide into w slots. We want to compute the mean of each slot, how do we proceed when n is not divisible by w ? This is called a Piecewise Aggregate Approximation (PAA).]]></summary></entry></feed>