OCR | KLN Islandora

admin

Full View
OCR

Wed, 03/13/2024 - 20:23

File

Edited Text

DISTRIBUTED DEEP LEARNING MODELS: USING TENSORFLOW AND
PYTORCH ON NVIDIA GPUs AND CLUSTER OF RASPBERRY PIs

By

Jagadish Kumar Ranbirsingh, B.TECH.
Biju Patnaik University of Technology

A Thesis Submitted in Partial Fulfillment of
the Requirements for the Degree of Master of Science in Computer Science
to the office of Graduate and Extended Studies of
East Stroudsburg University of Pennsylvania

May 10, 2019

SIGNATURE/APPROVAL PAGE
The signed approval page for this thesis was intentionally removed from the online copy by an
authorized administrator at Kemp Library.
The final approved signature page for this thesis is on file with the Office of Graduate and
Extended Studies. Please contact Theses@esu.edu with any questions.

ABSTRACT
A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of
Master of Science in Computer Science to the office of Graduate and Extended
Studies of East Stroudsburg University of Pennsylvania.
Student’s Name: Jagadish Kumar Ranbirsingh
Title: DISTRIBUTED DEEP LEARNING MODELS: USING TENSORFLOW AND
PYTORCH ON NVIDIA GPUs AND CLUSTER OF RASPBERRY PIs
Date of Graduation: May 10, 2019
Thesis Chair: Haklin Kimm, PhD
Thesis Member: Eun-Joo Lee, PhD
Thesis Member: Minhaz Chowdhury, PhD
Abstract
This thesis work focuses on distributed deep learning approaches implementing Human
Activity Recognition (HAR) using Recurrent Neural Network (RNN) Long Short-Term
Memory (LSTM) model using University of California at Irvine’s machine learning
database. This work includes developing the LSTM residual bidirectional architecture
using Python 3 programming language over distributed TensorFlow and PyTorch
programming frameworks on top of two testbed systems: the first one is Raspberry Pi
cluster that is built upon 16 Raspberry Pis, clustered together by using parameter server
architecture. Another one is the NVIDIA GPU cluster which is equipped with 3 GPUs
named Tesla K40C, Quadro P5000 and Quadro K620. Here we compare and observe the
performance of our deep learning algorithms in terms of execution time and prediction
accuracy with varying number of deep layers with hidden neurons in the neural networks.
Our first comparison is based on using TensorFlow and PyTorch over NVIDIA Maximus
distributed multicore architecture. The second comparison is the execution of the
Raspberry Pi cluster and Octa core Intel Xeon CPU. In this research we present that the
implementations of distributed neural network over the GPU cluster perform better than
the Raspberry Pi cluster and the multicore system.

ACKNOWLEDGMENTS

I would like to express my deep gratitude to my advisor, Dr. Haklin Kimm, for his
everlasting supports and guidance during research and thesis study. Without his
invaluable ideas, encouragements, suggestions and corrections, I couldn’t have achieved
this outcome. I really appreciate his mentorship in this journey. I am also thankful to the
members of my thesis examining committee, Dr. Eun-Joo Lee and Dr. Minhaz
Chowdhury.
I would like to express my sincere gratitude to the NVIDIA Corporation with the
donation of the Tesla K40c GPU and ESU Computer Science department for other GPUs
and Raspberry PIs used on this research project.
Most importantly none of this would have been possible without the patience and
sacrifice of my family. My wife to whom this dissertation is dedicated to, have been a
constant source of courage to push against all odds in these two years. I would like to
express my hearty gratitude to my parents.

TABLE OF CONTENTS
LIST OF TABLES………………………………………………………………………vii
LIST OF FIGURES……………………………………………………………………..viii
Chapter
I. INTRODUCTION……………………………………………………………………...1
1.1 Machine Learning with Big Data……………………………………………………...1
1.2 Deep Learning……………………………………………………………………........2
1.3 Deep Learning Using GPU……………………………………………………………3
1.4 Neural Nets……………………………………………………………………………3
1.4.1 Recurrent Neural Nets……………………………………………………………..4
1.5 Motivation……………………………………………………………………………..4
1.6 Thesis Contribution…………………………………………………………………....5
1.7 Outline of the Thesis…………………………………………………………………..6
II. PREVIOUS STUDIES…………………………………………………………………7
2.1 Big Data…………………………………………………………………………….....7
2.2 TensorFlow……………………………………………………………………………8
2.2.1 Architecture…………………………………………………………………….....9
2.3 Distributed TensorFlow……………………………………………………………...10
2.4 PyTorch……………………………………………………………………………....15
2.5 MXNet……………………………………………………………………………….16
III. RELATED WORKS………………………………………………………………....18
3.1 Distributed GraphLab Framework……………………………………………….......18
3.2 Parameter Server Framework……………………………………………………......19
3.2.1 Distributed Synchronous Stochastic Gradient Descent………………………….21
3.3 Deep Gradient Compression…………………………………………………………24
IV. LSTM FOR HUMAN ACTIVITY RECOGNITION……………………………….27
4.1 Human Activity Recognition………………………………………………………...27
4.1.1 Surveillance System……………………………………………………………...28
4.1.2 Healthcare………………………………………………………………………..28
4.1.3 Human Computer Interaction……………………………………………………29
4.1.4 HAR Sensing Technologies……………………………………………………...29
4.2 Data (UCI Repository)……………………………………………………………….30
4.2.1 Dataset Information……………………………………………………………...30
4.2.2 Attribute Information…………………………………………………………….31
4.2.3 Feature Notes…………………………………………………………………….31
4.3 LSTM…………………………………………………………………………….......35
4.3.1 Why LSTM……………………………………………………………………....35

v

4.3.1.1 CNN…………………………………………………………………………....35
4.3.1.2 Back Propagation ……………………………………………………………...37
4.3.1.3 RNN……………………………………………………………………………42
4.3.1.4 LSTM…………………………………………………………………………..46
4.3.1.5 Distributed LSTM……………………………………………………………...47
4.3.1.5.1 Synchronous All – Reduce SGD…………………………………………...48
4.3.2 Baseline LSTM…………………………………………………………………….50
4.3.3 Bidirectional LSTM……………………………………………………..................54
4.3.4 Residual LSTM…………………………………………………………………….55
4.3.5 Deep Residual Bidirectional LSTM…………………………………......................58
V. TEST BED SETUP…………………………………………………………………..64
5.1 NVIDIA GPU Test Bed Setup……………………………………………………….64
5.2 Cluster of Raspberry PIs Setup………………………………………………………72
5.3 Simulation using Raspberry PIs Cluster……………………………………………..77
5.4 Notes on TensorFlow Setup………………………………………………………….82
5.5 Notes on using PyTorch Setup……………………………………………………….85
VI. IMPLEMENTATION………………………………………………………………..87
6.1 Best Learning Rate…………………………………………………………………...87
6.2 CPU Execution Time between Layers…………………………………………….....89
6.3 GPU Execution Time between Layers…………………………………………….....91
6.4 Bidirectional Vs Non-Bidirectional Execution between Layers……………………..99
6.5 Stack Bidirectional Vs Stack Non-Bidirectional Execution between Layers………101
6.6 Best Accuracy between all Layers………………………………………………….103
6.7 Between Deep Residual Bidirectional between 3 x3 and 4 x4……………………..105
6.8 Between Deep Layer vs Prediction Accuracy vs Exe Time in CPU……………….106
6.9 Between Deep Layer vs Prediction Accuracy vs Exe Time in GPU…………….....108
6.10 Deep Layer CPU Execution……………………………………………………….110
6.11 Lower GPU vs Higher CPU……………………………………………………….111
6.12 4 x 4 CPU vs GPU Layers………………………………………………………...113
6.13 Bidirectional Lower vs Stack Higher Layers……………………………………...115
6.14 Stack vs Hidden layer on Execution Time and Prediction Accuracy……………..117
6.15 PyTorch vs TensorFlow Efficiency Comparison………………………………….120
6.16 Raspberry PI Cluster vs Intel Xeon CPU Efficiency Comparison………………..122
VII. CONCLUSION……………………………………………………………………124
7.1 Summary……………………………………………………………………………124
7.2 Future Work………………………………………………………………………...125
APPENDIX A – TEST BED ARCHITECTURE………………………………………127
APPENDIX B – SOURCE CODE……………………………………………………..143
APPENDIX C – TENSORFLOW SET UP…………………………………………….209
APPENDIX D – PYTORCH SETUP…………………………………………………..212
APPENDIX E – RASPBERRY PI CLUSTER………………………………………....214
REFERENCES…………………………………………………………………………217

vi

LIST OF TABLES
Table
1. Dataset Feature Parameters……………………………………………………………32
2. Hardware configuration of CPU………………………………………………………65
3. Hardware configuration of GPUs……………………………………………………..65
4. Hardware configuration of Raspberry Pi 3 Model B+………………………………...73
5. Raspberry Pi Cluster Monte Carlo Simulation2. Best Learning Rate………………...81
6. Best Learning Rate…………………………………………………………………….87
7. Execution rate between Layers in CPU……………………………………………….89
8. Execution rate between Layers in GPU……………………………………………….91
9. Bidirectional vs Non-bidirectional layers execution time…………………………….99
10. Stack Bidirectional vs Stack Non-bidirectional execution time……………………101
11. Best Accuracy in 3 deep layers……………………………………………………..103
12. Deep Residual 3 x3 vs 4 x 4 layers…………………………………………………105
13. Execution matrix of all layers by CPU……………………………………………..106
14. Execution matrix of all layers by GPU……………………………………………..108
15. 4 x 4 Layers deep CPU execution matrix…………………………………………..110
16. 3 x 3 Layer GPU vs 4 x4 Layer CPU execution matrix……………………………111
17. 4 x 4 Layers CPU vs GPU Execution………………………………………………113
18. 2 x 2 Bidirectional Stack Layer vs 3 x 3 Stack Layer………………………………115
19. 2 x 2 stacked hidden layers vs 4 x 4 stacked hidden layers………………………...117
20. Efficiency between PyTorch and TensorFlow……………………………………..120
21. Efficiency between Raspberry Pi Cluster and Intel Xeon CPU……………………122

vii

LIST OF FIGURES
Figure
1. TensorFlow General Architecture……………………………………………………..9
2. TensorFlow Master Worker Model…………………………………………………..11
3. Distributed Master workflow…………………………………………………………12
4. NVIDIA MultiGPU NCCL…………………………………………………………...13
5. Parameter Server Framework………………………………………………………...21
6. Distributed SGD………………………………………………………………………21
7. Deep Gradient Compression………………………………………………………….24
8. Dataset File Structure…………………………………………………………………34
9. Basic Feed-Forward and Recurrent cell………………………………………………35
10. Two Connected Neurons with weights………………………………………………40
11. Back Propagation Rule………………………………………………………………40
12. Convolution Neural Network………………………………………………………..41
13. RNN Sequential Data Learning Approach…………………………………………..43
14. Simple RNN Structure……………………………………………………………….44
15. LSTM Forget Gate…………………………………………………………………...50
16. LSTM Input Gate…………………………………………………………………….51
17. LSTM Processing Data………………………………………………………………51
18. LSTM Output Gate…………………………………………………………………..52
19. The unfolded structure of one-layer baseline LSTM………………………………...53
20. The structure of single layer bidirectional LSTM……………………………………54
21. The structure of single layer residual LSTM………………………………………...57
22. The structure of 2 x 2 residual bidirectional LSTM…………………………………60
23. The residual bidirectional LSTM parameters………………………………………..61
24. NVIDIA Driver Version……………………………………………………………..67
25. CUDA Toolkit Version………………………………………………………………68
26. GPU Memory Array…………………………………………………………………71
27. Distributed TensorFlow API………………………………………………………...72
28. Raspberry PIs NFS connection………………………………………………………76
29. NFS Status…………………………………………………………………………...76
30. TensorFlow Cluster API……………………………………………………………..79
31. TensorFlow Device API……………………………………………………………..79
32. TensorFlow Session API…………………………………………………………….80
33. TensorFlow Server API……………………………………………………………...80
34. Pi Cluster Execution Graph………………………………………………………….82
35. TensorFlow GPU Growth API………………………………………………………82
36. GPU StreamExecutor………………………………………………………………..83
37. GPU Device Selection……………………………………………………………….83
38. Distributed TF Multi GPU…………………………………………………………..84
39. PyTorch Distributed API……………………………………………………………85
40. PyTorch Memory Shuffle……………………………………………………………85
viii

41. Best Learning Rate…………………………………………………………………...88
42. Big Machine CPU Details…………………………………………………………....89
43. Bubble Chart of CPU Execution……………………………………………………..90
44. Column Graph of CPU Execution between Layers………………………………….91
45. Bubble Chart of GPU Execution……………………………………………………..92
46. Column Graph of GPU Execution between Layers………………………………….93
47. 2 x 2 Layers GPU Utilization Snapshot……………………………………………...94
48. 3 x 3 Layers GPU Utilization Snapshot……………………………………………...96
49. 4 x 4 Layers GPU Utilization Snapshot……………………………………………...97
50. 3 x 3 Layers CPU Utilization Snapshot……………………………………………...98
51. Column Graph of Execution time between bidirectional and non-bidirectional…...100
52. Deep Bidirectional vs Deep Non-bidirectional Execution time……………………102
53. Best Accuracy among all types of 3 stacked layers………………………………...104
54. Column graph of 3 x 3 vs 4 x 4 Deep Residual Layers…………………………….105
55. CPU Execution Graph for all Layers……………………………………………….107
56. GPU Execution Graph for all Layers……………………………………………….109
57. Column Graph of 4 x 4 deep Layers CPU execution……………………………….111
58. Column Graph of 3 x 3 GPU vs 4 x 4 CPU Execution Result……………………...113
59. Graph of 4 x 4 deep layers CPU vs GPU Execution………………………………..114
60. Graph of 2 x 2 Bidirectional Stack Layer vs 3 x 3 Stack Layer……………………116
61. Execution time Graph of 2 x2 vs 4x 4 stacked layers………………………………117
62. Execution time graph with stack layers vs hidden layers…………………………..119
63. Execution graph between TensorFlow and PyTorch……………………………….121
64. Execution graph between Single CPU vs Pi Cluster……………………………….123
65. NVIDIA GPU Cards………………………………………………………………..128
66. NVIDIA Driver Repository………………………………………………………...129
67. Graphics Display …………………………………………………………………...130
68. GDM Session……………………………………………………………………….132
69 NVIDIA Driver Successful Installation Snapshot…………………………………..133
70. Ubuntu Driver Display……………………………………………………………..134
71. CUDA Toolkit……………………………………………………………………...135
72. An L2-regularized version of the cost function used in SGD of RNN……………..140
73. TF Project Screen…………………………………………………………………...211
74. PyTorch Project Screen……………………………………………………………..213

ix

CHAPTER I
INTRODUCTION
1.1 Machine Learning With Big Data
More than 2.5 quintillion bytes of data are created each day. The prevalence of data will
only increase, so we need to learn how to deal with such large data. Storing this data is
one thing, but what about processing it and developing machine learning algorithms to
work with it? Solving complex computational problems in a short amount of time, as well
as dealing with large-sized data sets and massive amounts of continuously growing data,
are some challenges that are being addressed by parallel processing algorithms. Data
centers deployed with high-end GPUs enable computational storage and network
processing power to support such highly demanding workloads. Access to thousands of
cores of each GPU with high-capacity network and high-IOPS (Input/Output Operations
Per Second) storage allows for ideal infrastructure, which are built for HPC and Big Data
applications. But this is not enough in future. This line of research should focus on
developing new Machine Learning (ML) models on adapting (scaling up) existing
models in order to handle larger scale datasets.

1

1.2 Deep Learning
Deep Learning [1] is a sub-field of machine learning concerned with algorithms inspired
by the structure and function of the brain called artificial neural networks. It uses nonlinear processing units with multiple layers for feature transformation and extraction. It
also reflects concepts in multiple hierarchical fashions which corresponds to various
levels of abstraction. As per Jeff Dean scientist of Google AI Brain, “When you hear the
term deep learning, just think of a large deep neural net. Deep refers to the number of
layers typically and so this kind of the popular term that’s been adopted in the press. I
think of them as deep neural networks generally.” Modern neural network architectures
trained on large datasets can obtain impressive performance across a wide variety of
domains, from speech and image recognition, natural language processing and industryfocused applications such as fraud detection and recommendation systems.
Deep Learning (DL) has become a true enabler of AI services. In fact, it is the key driver
behind today’s entire field of AI with its real-life practical applications. DL’s business
utilization and its ability to support business objectives have enabled AI services to take a
hot spot at the company strategic table. From life and health sciences, through
engineering and financial modeling, to natural language processing and image
recognition, the employment of DL is growing exponentially year by year. This growth in
applications of AI services is primarily due to the infrastructure behind the curtain and its
utilization of parallel computing with increasingly more advanced GPU technologies to
enable such progress.
As the computational power of the machines grow exponentially, the need come to move
to higher computation CPUs, when CPUs couldn’t provide enough solutions then
technology leaped from CPU to GPU. For DL to take full advantage of the GPU
hardware architecture and acceleration, there needs to be an “easy” way to allow
algorithms to leverage, scale up and consume underlying infrastructure. DL frameworks
represent and combine such sets of tools, interfaces, and libraries, which allow data
2

scientists, engineers, and developers to build, deploy and manage their training models
and networks. They are the building blocks of modern DL deployments. Today, the most
popular DL Frameworks include, but are not limited to Tensorflow, Keras, Caffe 2,
Pytorch, Theano, Chainer, CNTK, and MXNET. Each of these frameworks is built in a
different manner and serves different purposes.

1.3 Deep Learning Using GPU
Deep Learning Neural Networks are becoming continuously more complex. The number
of layers and neurons in a Neural Network is growing significantly, which lowers
productivity and increases costs. DL deployments leveraging GPUs [2] drastically reduce
the size of the hardware deployments, increase scalability, dramatically reduce the
training and ROI times and lower the overall deployment cost. The new GPU based
systems with access to the latest NVIDIA GPU architectures with PCIe interface or with
NVLink interconnections can utilize the access to a massive amount of DL computing
power by using GPU clusters.

1.4 Neural Nets
There are three classes of artificial neural networks in general. They are:
Multilayer Perceptrons (MLPs)
Convolutional Neural Networks (CNNs)
Recurrent Neural Networks (RNNs)
In this project we have extensively used RNNs [3] because of their internal memory.
RNNs are able to remember important things about the input they receive, which enables
them to be very precise in predicting the future value.

3

1.4.1 Recurrent Neural Nets
RNNs are the state of the art algorithm for sequential data and used by Apples Siri and
Googles Voice Search. This is because, it is the first algorithm that remembers its input,
due to an internal memory, which makes it perfectly suited for Machine Learning
problems that involve sequential data. It is one of the algorithms behind the scenes of the
amazing achievements of Deep Learning [2] in the past few years. In a RNN, the
information cycles through a loop. When it makes a decision, it takes into consideration
the current input and also learned values received from previous inputs. Therefore a RNN
has two inputs, the present and the recent past. A usual RNN has a short-term memory.
In combination with a LSTM [4] they also have a long-term memory which is very
powerful and used in computation of complex datasets.

1.5 Motivation
The single CPU machine learning is old. It’s there from 1959 which is coined by
Samuel, Arthur L [5].It was published in IBM Journal of Research and Development.
Then comes GPU. GeForce 256 was marketed as "worlds first 'GPU', or “Graphics
Processing Unit”, a term coined by NVIDIA at that time as "a single-chip processor with
integrated lighting, triangle setup/clipping, and rendering engines that is capable of
processing a minimum of 10 million polygons per second. The GeForce 256 is the
original release in NVIDIA's "GeForce" product-line announced on August 31, 1999 and
released on October 11, 1999. [6] The machine learning using GPU based data
warehouse is new and still going on. Now a days the rate of data generation is very high
because of social networking sites, like Facebook, Twitter, WhatsApp, WeChat,
Instagram and the list goes on. With advancement in technologies, sensor networks, IoT
things, automated systems generate much more data every seconds. So in near future, the
data warehouses would be established in multiple geographical areas across the globe.
Unfortunately, current deep learning methodologies which based on single location or
single dataset won’t work. Distributed optimization and inference is becoming a
4

prerequisite for solving large scale deep learning problems. At scale no single machine
can solve these problems efficiently, due to the growth of data and the resulting model
complexity, often manifesting itself in an increased number of parameters. [7]

1.6 Thesis Contribution
Inspired by Scaling distributed machine learning with the parameter server [8], we
proposed a cluster based platform which is designed by parameter server architecture.
The thesis focus on distributed deep learning models to simulate Human Activity
Recognition. The Deep Learning LSTM model do the iterations on UCI dataset by using
distributed TensorFlow, PyTorch programming frameworks. This includes writing the
LSTM residual bidirectional architecture using Python 3 programming language and
TensorFlow and PyTorch APIs, where both the APIs support the distributed architecture.
Following which, the program is verified in the distributed platform. To meet the
distributed hardware demand two platforms are created, first hardware is Raspberry Pi
cluster having 16 nodes, which is built upon 16 Raspberry Pis 3 B+ models clustered
together by using parameter server architecture, each having 1 GB of RAM and 32 GB of
flash storage. Second hardware is, the NVIDIA GPUs cluster which is having 3 GPUs
named Tesla K40c, Quadro P5000 & Quadro K620. It is built by NVIDIA Maximus
formation on top of Octa-core Intel Xeon CPU having 32 GB RAM and 2 TB SSD
primary storage with 10 TB HDD secondary storage. Thus, comparing and observing the
performance in terms of executing speed and effieciency of deep learning iterations by
varying number of deep layers with hidden neurons in GPUs and CPUs. While the first
approach is based on using TensorFlow and PyTorch over NVIDIA GPUs parallel and
distributed multicore architecture. The second approach is by comparing the execution
speed and efficiency of CPUs of Pi cluster along with Inter Xeon CPU. The research
focuses on energy-efficient deep learning computing, which is at the intersection between
deep learning and distributed computer.

5

1.7 Outline of the Thesis
The remaining of the thesis is organized in the following.
Chapter 2 gives some background information about work on the distributed deep
learning models. It introduces the distributed deep learning APIs of TensorFlow,
PyTorch.
Chapter 3 demonstrates related research work on this topic. It introduces the similar
problems and previous research happened on it.
Chapter 4 discusses most about the Why LSTM? , Different LTSM architectures and our
proposed LSTM residual bidirectional layer.
Chapter 5 shows the preparation of test beds for this research. It shows the related works
during the hardware cluster development.
Chapter 6 implements the deep learning models in distributed platform and compares the
GPU computational power between two API in GPU cluster and the computational power
between the CPU cluster and standalone CPU machine while doing the iterations. It gives
all the implementations with execution time and predicted accuracy with varying dense
layers along with different hidden nodes.
Chapter 7 is the conclusion of our thesis work.

6

CHAPTER II
PREVIOUS STUDIES

2.1 Big Data
We now live in the era of the big data. In this era, the volume of data has exploded. The
magnitude of data generated and shared by businesses, public administrations, numerous
industrial sectors, not-for-profit sectors and scientific research has increased
immeasurably [9]. These data include textual content (i.e. structured, semi-structured as
well as unstructured) to multimedia content (e.g. videos, images, audio) on a multiplicity
of platforms (e.g. machine-to-machine communications, social media sites, sensor
networks, cyber-physical systems and Internet of Things [IoT]). Dobre and Xhafa [10]
report that every day the world produces around 2.5 quantilion gigabytes of data (2.3
trillon gigabytes), with 90% of these data generated in the world being unstructured. It is
assert that by 2020, over 40 Zettabytes (or 40 trillion gigabytes) of data will be generated,
imitated, and consumed. With this overwhelming amount of complex and heterogeneous
data pouring from any-where, any-time and any-device there is undeniably an era of Big
Data – a phenomenon also referred to as the Data Deluge. In essence, Big Data is the
artifact of each human individual as well as collective intelligence generated and shared
mainly through the technological environments where virtually anything and everything
7

can be documented, measured and captured digitally, and while doing that transformation
into data – a process that Mayer-Schönberger and Cukier [11] also referred as
datafication. Regardless of where Big Data is generated from and shared to, with the
reality of Big Data come the challenges of analyzing it in a way that brings Big Value.
Nevertheless, the growth of data in volumes in the digital world seems to out-speed the
advancement of many extant computing infrastructures. The well-established data
processing technologies, for example databases and data warehouses are becoming
inadequate infront of the amount of data the world is going to generate. The massive
amount of data needs to be analyzed in an iterative, as well as in a time sensitive manner.
The ability to work with this massive scale of datasets is very critical. Traditional
computing approaches with a single computer having a multicore processor to deal with
some amount of data are not suitable for this massive scale datasets.
In the post-ImageNet [12] era, computer vision and machine learning researchers are
solving more complicated AI problems using larger datasets which drives the demand for
more computation. However, Moore’s Law is slowing down, Dennard scaling has
stopped, and the amount of computation per unit cost and power is no longer increasing
at its historic rate. This mismatch between supply and demand of computation highlights
the need for co-designing efficient machine learning algorithms and domain-specific
hardware architectures for massive scale datasets. The vast design space across algorithm
and hardware is difficult to be explored by available engineered applications or tools.
Therefore, we need different architectures with distributed workloads to bridge the gap.

2.2 TensorFlow
Created by the Google Brain team, TensorFlow [13] is an open source library for
numerical computation and large-scale machine learning. TensorFlow bundles together a
slew of machine learning and deep learning (aka neural networking) models and
algorithms and makes them useful by way of a common metaphor. It uses Python to
provide a convenient front-end API for building applications with the framework, while
8

executing those applications in high-performance C++. TensorFlow can train and run
deep neural networks for handwritten digit classification, image recognition, word
embeddings, sequence-to-sequence models for machine translation, natural language
processing, and PDE (partial differential equation) based simulations. TensorFlow
supports production prediction at scale, with the same models used for training.
2.2.1 Architecture
The TensorFlow is a cross-platform library. Figure 1 illustrates its general architecture. C
API separates user level code in different languages from the core runtime.

Figure .1 TensorFlow General Architecture
Client


Defines the computation as a dataflow graph.



Initiates graph execution using a session.

Distributed Master


Prunes a specific subgraph from the graph, as defined by the arguments
to session.run().
9



Partitions the subgraph into multiple pieces that run in different processes and
devices.



Distributes the graph pieces to worker services.



Initiates graph pieces execution by worker services.

Worker Services (one for each task)


Schedule the execution of graph operations using kernel implementations,
appropriate
to the available hardware (CPUs, GPUs, etc).



Send and receive operation results to and from other worker services.

Kernel Implementations


Perform the computation for individual graph operations.

2.3 Distributed TensorFlow
TensorFlow is designed for large-scale distributed training and inference, but it is also
flexible enough to support small scale new machine learning models and system-level
optimizations.
tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs,
multiple machines or TPUs. Using this API, users can distribute their existing models and
training code with minimal code changes.

10

Figure. 2 TensorFlow Master Worker Model
Client
Users write the client TensorFlow program that builds the computation graph. This
program can either directly compose individual operations or use a convenience library
like the Estimators API to compose neural network layers and other higher-level
abstractions. TensorFlow supports multiple client languages but prioritized Python and
C++ for use. The client creates a session, which sends the graph definition to the
distributed master as a tf.GraphDef protocol buffer. When the client evaluates a node or
nodes in the graph, the evaluation triggers a call to the distributed master to initiate
computation. In Figure 3, the client has built a graph that applies weights (w) to a feature
vector (x), adds a bias term (b) and saves the result in a variable (s).
The Distributed Master
The master prunes the graph to obtain the subgraph required to evaluate the nodes
requested by the client, then partitions the graph to obtain graph pieces for each
participating device, and caches these pieces so that they may be re-used in subsequent
steps.
Since the master sees the overall computation for a step, it applies standard optimizations
11

such as common subexpression elimination and constant folding. It then coordinates
execution of the optimized subgraphs across a set of tasks.

Figure. 3 Distributed Master workflow
Worker Service
The worker service in each task handles requests from the master, schedules the
execution of the kernels for the operations that comprise a local subgraph, and mediates
direct communication between tasks. TensorFlow optimize the worker service for
running large graphs with low overhead. This current implementation can execute tens of
thousands of subgraphs per second, which enables a large number of replicas to make
rapid, fine-grained training steps. The worker service dispatches kernels to local devices
and runs kernels in parallel when possible, for example by using multiple CPU cores or
GPU streams.
12

TensorFlow specialize Send and Recv operations for each pair of source and destination
device types.Transfers between local CPU and GPU devices use the
cudaMemcpyAsync() API to overlap computation and data transfer.Transfers between
two local GPUs use peer-to-peer DMA, to avoid an expensive copy via the host CPU. For
this reason, working on GPUs using TensorFlow is much faster as compared to CPU.For
transfers between tasks, TensorFlow uses multiple protocols, including: gRPC over TCP,
RDMA over Converged Ethernet.
TensorFlow have preliminary support for NVIDIA's NCCL library for multi-GPU
communication. The supported API is tf.contrib.nccl.

Figure. 4 Nvidia MultiGPU NCCL
The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and
multi-node collective communication primitives that are performance optimized for
NVIDIA GPUs. NCCL provides routines such as all-gather, all-reduce, broadcast,
reduce, reduce-scatter, that are optimized to achieve high bandwidth over PCIe and
NVLink high-speed interconnect. In Figure 4, it reflects NCCL communication.
Kernel Implementations
The runtime contains over 200 standard operations including mathematical array
manipulation, control flow and state management operations. Each of these operations
13

can have kernel implementations optimized for a variety of devices. In many of the
operations, kernels are implemented using Eigen::Tensor, which uses C++ templates to
generate efficient parallel code for multicore CPUs and GPUs; TensorFlow uses libraries
like cuDNN where a more efficient kernel implementation is possible. TensorFlow
implements quantization, which enables faster inference in environments such as mobile
devices and high-throughput datacenter applications, and use the gemmlowp lowprecision matrix library to accelerate quantized computation. (gemmlowp is a library for
multiplying matrices whose entries are quantized as 8-bit integers. It is used in mobile
neural network applications. It has received contributions from Intel and ARM, ensuring
that it is efficient on various mobile CPUs).
If it is difficult or inefficient to represent a subcomputation as a composition of
operations, users can register additional kernels that provide an efficient implementation
written in C++. For better computation, TensorFlow recommends registering own fused
kernels for some performance critical operations, such as the ReLU and Sigmoid
activation functions and their corresponding gradients. The XLA Compiler has an
experimental implementation of automatic kernel fusion.
TensorFlow provides eager execution mode for developers who need to debug and gain
introspection into TensorFlow apps, which lets you evaluate and modify each graph
operation separately and transparently, instead of constructing the entire graph as a single
opaque object and evaluating it all at once. The TensorBoard visualization suite lets the
developer inspect and customize the graphs by way of an interactive, web-based
dashboard.

14

2.4 PyTorch
PyTorch [14] is a Python open source deep learning framework that was primarily
developed by Facebook’s artificial intelligence research group and was publicly
introduced in January 2017.
Building Block #1: Tensors
PyTorch provides a basic data structure called a Tensor, which is very similar to
NumPy’s ndarray. But unlike the latter, tensors can tap into the resources of a GPU to
significantly speed up matrix operations.
Building Block #2: Computation Graph
When a neural network is trained, researchers need to compute gradients of the loss
function, with respect to every weight and bias, and then update these weights using
gradient descent. With neural networks hitting billions of weights, doing the above step
efficiently can make or break the feasibility of training.
In PyTorch, the computation graph is simply a data structure that allows to efficiently
apply the chain rule to compute gradients for all of your parameters.
Building Block #3: Variables and Autograd
The Variable is just like a Tensor, is a class that is used to hold data. Variables are
specifically tailored to hold values which change during training of a neural network, i.e.
the learnable parameters of the network. Tensors on the other hand are used to store
values that are not to be learned. For example, a Tensor maybe used to store the values of
the loss generated by each example.
The graph is differentiated using the chain rule. If any of tensors are non-scalar (i.e. their
data has more than one element) and require gradient, the function additionally requires
specifying grad_tensor. It should be a sequence of matching length, which contains
gradient of the differentiated function with respect to corresponding tensors.
15

Building Block #4: Function
PyTorch abstracts the need to write two separate functions (for forward, and for
backward pass), into two member of functions of a single class called
torch.autograd.Function.
PyTorch combines Variables and Functions to create a computation graph.
Dynamic Computation Graphs
A Dynamic Computational Graph framework is a system of libraries, interfaces, and
components that provide a flexible, programmatic, run time interface that facilitates the
construction and modification of systems by connecting operations. PyTorch creates the
runtime dynamic computation graphs.
To qualify as a Dynamic Computational Graph framework, the framework must merely
support the deferring of the determination of algorithm to run time, therefore opening the
door to a plethora of operations on the computational dependencies and data flow at run
time. The basics of the operations deferred must include the specification, manipulation,
execution, and storage of the directed graphs that represent systems of operations.

The advantage of Dynamic Computational Graphs appears to include the ability to adapt
to a varying quantities in input data. It seems like there may be automatic selection of the
number of layers, the number of neurons in each layer, the activation function, and other
neural network parameters, depending on each input set instance during the training.

2.5 MXNet
MXNet [15] is a deep Learning framework created by Apache, which supports a plethora
of languages, like Python, Julia, C++, R, or JavaScript. It’s been adopted by Microsoft,
Intel, and Amazon Web Services.
16

The MXNet framework is known for its great scalability, which is used by large
companies mainly for speech and handwriting recognition, NLP, and forecasting.

17

CHAPTER III
RELATED WORKS
3.1 Distributed GraphLab Framework
There are several distributed machine learning framework works available today. The
high-level data parallel frameworks, like MapReduce, simplify the design and
implementation of large-scale data processing systems, but they do not efficiently support
many important data mining and machine learning algorithms and can lead to inefficient
learning systems. To fill this critical void, the GraphLab abstraction is introduced which
naturally expresses asynchronous, dynamic, graph-parallel computations while ensuring
data consistency and achieving a high degree of parallel performance in the sharedmemory setting. [16]
Turi is a graph-based, high performance, distributed computation framework written in
C++. The GraphLab project was started by Prof. Carlos Guestrin of Carnegie Mellon
University in 2009. It is an open source project using an Apache License. While
GraphLab was originally developed for Machine Learning tasks, it has found great
success at a broad range of other data-mining tasks; out-performing other abstractions by
orders of magnitude. [17]
As the amounts of collected data and computing power grows (multicores, GPUs,
clusters, clouds), modern datasets are no longer fit into one computing node. Efficient
18

distributed/parallel algorithms for handling large scale datasets are required. The
GraphLab framework is a parallel programming abstraction targeted for sparse iterative
graph algorithms. GraphLab provides a high level programming interface, allowing a
rapid deployment of distributed machine learning algorithms. [18] The main design
considerations behind the design of GraphLab are, sparse data with local dependencies,
iterative algorithms, potentially asynchronous execution.
Main features of GraphLab are


a unified multicore and distributed API, write once run efficiently in both shared
and distributed memory systems



It is tuned for performance by optimized C++ execution engine leverages
extensive multi-threading and asynchronous IO



Scalable, GraphLab intelligently places data and computation using sophisticated
new algorithms



HDFS Integration



Powerful Machine Learning Toolkits

GraphLab framework is extended to the substantially more challenging distributed setting
while preserving strong data consistency guarante. The developed graph based
extensions, used to pipelined locking and data versioning to reduce network congestion
and mitigate the effect of network latency. The introduced fault tolerance in the
GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm demonstrate
how easily it can be implemented by exploiting the GraphLab abstraction itself.

3.2 Parameter Server Framework
The parameter server is designed to simplify developing distributed machine learning
applications as shown in Figure. 5 [8]. An instance of the parameter server can run more
than one algorithm simultaneously. Parameter server nodes are grouped into a server
group and several worker groups as shown in Figure 5. A server node in the server group
19

maintains a partition of the globally shared parameters. Server nodes communicate with
each other to replicate and/or to migrate parameters for reliability and scaling. A server
manager node maintains a consistent view of the metadata of the servers, such as node
liveness and the assignment of parameter partitions. Each worker group runs an
application. A worker typically stores locally a portion of the training data to compute
local statistics such as gradients. Workers communicate only with the server nodes (not
among themselves), updating and retrieving the shared parameters. There is a scheduler
node for each worker group. It assigns tasks to workers and monitors their progress. If
workers are added or removed, it reschedules unfinished tasks.
The parameter server supports independent parameter namespaces. This allows a worker
group to isolate its set of shared parameters from others. Several worker groups may also
share the same namespace: we may use more than one worker group to solve the same
deep learning application [19] to increase parallelization. Another example is that of a
model being actively queried by some nodes, such as online services consuming this
model. Simultaneously the model is updated by a different group of worker nodes as new
training data arrives.
The shared parameters are presented as (key, value) vectors to facilitate linear algebra
operations. They are distributed across a group of server nodes. Any node can both push
out its local parameters and pull parameters from remote nodes. By default, workloads, or
tasks, are executed by worker nodes; however, they can also be assigned to server nodes
via user defined functions. Tasks are asynchronous and run in parallel. The parameter
server provides the algorithm designer with flexibility in choosing a consistency model
via the task dependency graph and predicates to communicate a subset of parameters.

20

Figure. 5 Parameter Server Framework

3.2.1 Distributed Synchronous Stochastic Gradient Descent

Figure. 6 Distributed SGD
21

In Figure. 6 each node independently calculates gradients by worker nodes.
In real time scenario, each training node performs the forward-backward pass on different
batches sampled from the training dataset with the same network model. The gradients
from all nodes are summed up to optimize their models. By this synchronization step,
models on different nodes are always the same during the training. The aggregation step
can be achieved in two ways. One method is using the parameter servers as the
intermediary which store the parameters among several servers [20]. The nodes push the
gradients to the servers while the servers are waiting for the gradients from all nodes.
Once all gradients are sent, the servers update the parameters, and then all nodes pull the
latest parameters from the servers.
One major disadvantage is network bandwidth. Large-scale distributed training improves
the productivity of training deeper and larger models (Chilimbi et al., 2014; Xing et al.,
2015; Moritz et al., 2015; Zinkevich et al., 2010). Synchronous stochastic gradient
descent (SGD) is widely used for distributed training. By increasing the number of
training nodes and taking advantage of data parallelism, the total computation time of the
forward-backward passes on the same size training data can be dramatically reduced.
However, gradient exchange is costly and dwarfs the savings of computation time (Li et
al., 2014; Wen et al., 2017), especially for recurrent neural networks (RNN) where the
computation-to-communication ratio is low. Therefore, the network bandwidth becomes
a significant bottleneck for scaling up distributed training. [21]

22

Algorithm 1. Distributed Subgradient Descent

As shown in Algorithm 1, the training data is partitioned among all the workers, which
jointly learn the parameter vector w. Because each worker works independently, the
system uses a mechanism by expressing the updates as a subgradient—a direction in
which the parameter vector w should be shifted and aggregates all subgradients before
applying them to w. Data is sent between nodes using push and pull operations. A tasks is
issued by a remote procedure call. It can be a push or a pull that a worker issues to
servers. It can also be a user-defined function that the scheduler issues to any node. Tasks
may include any number of subtasks. Tasks are executed asynchronously. In Algorithm
23

1, a worker pushes its temporary local gradient g to the parameter server for aggregation.
The most expensive step in Algorithm 1 is computing the subgradient to update w. This
task is divided among all of the workers, each of which execute WORKERITERATE.
The task WORKERITERATE in Algorithm 1 contains one push and one pull. In
Algorithm 1 each worker pushes its entire local gradient into the servers, and then pulls
the updated weight back. The aggregation logic in SERVERITERATE updates the
weight w only after all worker gradients have been aggregated.

3.3 Deep Gradient Compression

Figure. 7 Deep Gradient Compression
Deep Gradient Compression (DGC) solves the communication bandwidth problem by
compressing the gradients, as shown in Figure 8. To ensure no loss of accuracy, DGC
employs momentum correction and local gradient clipping on top of the gradient
sparsification to maintain model performance. DGC also uses momentum factor masking
and warmup training to overcome the staleness problem caused by reduced
communication. [21]
24

Techniques in Deep Gradient Compression
Gradient Sparsification
Reduce the communication bandwidth by sending only the important gradients.
User the gradient magnitude as a simple heuristics for importance.
Only gradients larger than a threshold are transmitted which is top 0.01%.
Local Gradient Accumulation Gradient Accumulation algorithms represent an
important component of distributed training systems. These algorithms are responsible
for accumulating the local gradients from each worker node and distributing the updated
global gradients back to the worker nodes. The All Reduce algorithm makes for a very
good fit for this functionality and also removes the need for a master server by espousing
a peer to peer paradigm for data exchange.
Local Gradient Clipping Gradient clipping is widely adopted to avoid the exploding
gradient problem [22]. The method proposed by Pascanu et al. (2013) rescales the
gradients whenever the sum of their L2-norms exceeds a threshold. This step is
conventionally executed after gradient aggregation from all nodes. The accumulation of
gradients over iterations on each node can be performed independently, where the
gradient clipping is performed locally before adding the current gradient Gt to previous
accumulation (Gt-1 in Algorithm 2) [23].
Momentum Factor Masking Mitliagkas et al. (2016) discussed the staleness caused by
asynchrony and described it as implicit momentum. Inspired by that, it introduce
momentum factor masking, to alleviate staleness. Instead of searching for a new
momentum coefficient as suggested in Mitliagkas et al. (2016) [24], it simply apply the
same mask to the accumulated gradients. This mask stops the momentum for delayed
gradients, preventing the stale momentum from carrying the weights in the wrong
direction.
25

Algorithm 2. All-reduce Algorithm with local gradient clipping

When training the recurrent neural network with gradient clipping, gradient clipping is
performed locally before adding the current gradient Gkt to previous accumulation Gkt-1
in Algorithm 2.

26

CHAPTER IV
HUMAN ACTIVITY RECOGNITION
USING LSTM

4.1 HUMAN ACTIVITY RECOGNITION
Human Activity Recognition (HAR) is a broad field of study concerned with an
ability to interpret human body gesture or motion via sensors and determine human
activity or action [25]. Most of the human daily tasks can be simplified or automated if
they can be recognized via HAR system. Typically, HAR system can be either supervised
or unsupervised [26]. A supervised HAR system requires some prior training with
dedicated datasets while unsupervised HAR system is being configured with a set of rules
during development. HAR is considered as an important component in various scientific
research contexts i.e. surveillance, healthcare and human computer interaction (HCI)
However, it remains a very complex task, due to unsolvable challenges such as sensor
motion, sensor placement, cluttered background, and inherent variability in the way
activities are conducted by different human. HAR covers three area of sensing
technologies namely RGB cameras, depth sensors and wearable devices. The popularity
of depth sensors and wearable devices in HAR research is well established.
27

4.1.1. Surveillance System
In surveillance context, HAR was adopted in surveillance systems installed at public
places i.e. shopping malls or airports, which introduced a new paradigm of human
activity prediction to prevent crimes and dangerous activities from occurring at public
places. Lasecki et al. proposed a system that provides robust, deploy-able activity
recognition by supplementing existing recognition systems with on-demand, real-time
activity identification using inputs from the crowds at public places [27].
4.1.2. Healthcare
In the field of Healthcare, HAR is employed in healthcare systems which are
installed in residential environment, hospitals and rehabilitation centers. HAR is used
widely for monitoring the activities of elderly people staying in rehabilitation centers for
chronic disease management and disease prevention [28]. HAR is also integrated into
smart homes for tracking the elderly people’s daily activities [29]. Besides, HAR is used
to encourage physical exercises in rehabilitation centers for children with motor
disabilities [30], post-stroke motor patients, patients with dysfunction and psycho motor
slowing, and exergaming [31]. Other than that, the HAR is adopted in monitoring patients
at home such as estimation of energy expenditure to aid in obesity prevention and
treatment and life logging. HAR is also applied in monitoring other behaviors such as
stereotypical motion conditions in children with Autism Spectrum Disorders (ASD) at
home, abnormal conditions for cardiac patients and detection of early signs of illness.
Other healthcare related HAR solutions such as fall detection and intervention for elderly
people are available [32].
28

4.1.3. Human Computer Interaction
In the field of human computer interaction, HAR has been applied quite
commonly in gaming and exergaming such as Kinect, Nitendo Wii and full-body motion
based games for older adults and adults with neurological injury [33]. Through HAR,
human body gestures are recognized to instruct the machine to complete dedicated tasks.
Elderly people and adults with neurological injury can perform a simple gesture to
interact with games and exergames easily. HAR also enables surgeons to have intangible
control of the intraoperative image monitor by using standardized free-hand movements
[34].

4.1.4 HAR Sensing Technologies
Recognizing human activity using RGB camera is simple but having low
efficiency. A RGB camera is usually attached to the environment and the HAR system
will process image sequences captured with the camera. Most of the conventional HAR
systems using this sensing technology are built with two major components which is the
feature extraction and classification [35]. Besides, most of the RGB-HAR systems are
considered as supervised system where trainings are usually needed prior to actual use.
Image sequences and names of human activities are fed into the system during training
stage. Real time captured image sequence are passed to the system for analysis and
classification by dedicated computational/classification algorithms such as Support
Vector Machine (SVM).
The depth sensor also known as infrared sensor or infrared camera is adopted into
HAR systems for recognizing human activities. The depth sensor projects infrared beams
into the scene and recapture them using its infrared sensor to calculate and measure the
depth or distance for each beam from the sensor. The reviews found that Microsoft
Kinect sensor is commonly adopted as depth sensor in HAR [33]. Since the Kinect sensor
29

has the capability to detect 20 human body joints with its real-world coordinate, many
researchers utilized the coordinates for human activity classification.
HAR using wearable-based requires single or multiple sensors to be attached to
the human body. Most commonly used sensor includes 3D-axial accelerometer,
magnetometer, gyroscope and RFID tag. With the advancement of current smart phone
technologies, many research works use mobile phone as sensing devices because most
smart phones are equipped with accelerometer, magnetometer and gyroscope [36]. A
physical human activity can be identify easily through analyzing the data generated from
various wearable sensing after being process and determine by classification algorithm.

4.2 Dataset (UCI Repository)

4.2.1 Data Set Information
The dataset named “Human Activity Recognition Using Smartphones Data Set” [37] is
used from UCI repository in this thesis. The experiments have been carried out with a
group of 30 volunteers within an age bracket of 19-48 years. Each person performed six
activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS,
SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the
waist. Using its embedded accelerometer and gyroscope, they captured 3-axial linear
acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments
have been video-recorded to label the data manually. The obtained dataset has been
randomly partitioned into two sets, where 70% of the volunteers was selected for
generating the training data and 30% the test data.
The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise
filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap
(128 readings/window). The sensor acceleration signal, which has gravitational and body
motion components, was separated using a Butterworth low-pass filter into body
30

acceleration and gravity. The gravitational force is assumed to have only low frequency
components, therefore a filter with 0.3 Hz cutoff frequency was used. From each
window, a vector of features was obtained by calculating variables from the time and
frequency domain.

4.2.2 Attribute Information
For each record in the dataset it is provided:
→ Triaxial acceleration from the accelerometer (total acceleration) and the estimated
body acceleration.
→ Triaxial Angular velocity from the gyroscope.
→ A 561-feature vector with time and frequency domain variables.
→ Its activity label.
→ An identifier of the subject who carried out the experiment.
4.2.3 Feature Notes
→ Features are normalized and bounded within [-1, 1].
→ Each feature vector is a row on the text file.
→ The units used for the accelerations (total and body) are 'g's (gravity of earth →
9.80665 m/sec2).
→ The gyroscope units are rad/sec.

31

The file structure inside dataset are described in Table 1.

File name

Information

activity_labels.txt

Links the class labels with their activity
name.

features_info.txt

Shows information about the variables used
on the feature vector.

features.txt

List of all features.

README.txt

Information about dataset details

test/X_test.txt

Test set

test/y_test.txt

Test labels

train/X_train.txt

Training set

train/y_train.txt

Training labels

Inertial Signals/body_acc_x_train.txt

The body acceleration signal obtained by

Inertial Signals/body_acc_y_train.txt

subtracting the gravity from the total

Inertial Signals/body_acc_z_train.txt

acceleration. Every row shows a 128
element vector. The same description
applies for the 'body_acc_y_train .txt' and
'body_acc_z_train .txt' files for the Y and Z
axis.

Inertial Signals/body_gyro_x_train.txt

The angular velocity vector measured by

Inertial Signals/body_gyro_y_train.txt

the gyroscope for each window sample.

Inertial Signals/body_gyro_z_train.txt

The units are radians/second. Every row
shows a 128 element vector. The same
32

description applies for the
‘body_gyro_y_train.txt' and
'body_gyro_z_train.txt' files for the Y and
Z axis.
Inertial Signals/total_acc_x_train.txt

The acceleration signal from the

Inertial Signals/total_acc_y_train.txt

smartphone accelerometer X axis in

Inertial Signals/total_acc_z_train.txt

standard gravity units 'g'. Every row shows
a 128 element vector. The same description
applies for the 'total_acc_y_train.txt' and
'total_acc_z_train.txt' files for the Y and Z
axis.

Table. 1 Dataset Feature Parameters

33

The Figure. 8 shows the hierarchy of the file structure inside dataset.

Figure. 8 Dataset File Structure
34

4.3 LSTM
LSTM network was proposed by Jürgen Schmidhuber in 1997 [4], is a variant of
recurrent neural networks (RNNs). It has special inner gates that allow for consistently
better performance than RNN for time series. Compared with other networks, such as
CNN, restricted Boltzmann machine (RBM) and auto-encoder (AE), the structure of the
LSTM renders it especially good at solving problems involving time series, such as those
related to natural language processing, speech recognition, and weather prediction,
because its design enables gradients to flow through time readily.

4.3.1 Why LSTM?
4.3.1.1 CNN
The basic difference between a feed forward neuron and a recurrent neuron is shown in
Figure 9.

Figure. 9 Basic Feed-Forward and Recurrent cell

35

The feed forward neuron has two weights which connects from his input to his
output. The recurrent neuron has also a connection from his output again to his input and
therefore it has three weights. When many feed-forward layers are connected together,
they form a Convolutional Neural Network (CNN). This third extra connection is called
feed-back connection and with that the activation can flow round in a loop.
When many feed forward and recurrent neurons are connected, they form a Recurrent
Neural Network (RNN). The major difference between CNN and RNN is that CNN is a
feed-forward neural network, while RNN is a recurrent neural network. In CNN, the
information only flows in the forward direction, while in RNN, the information flows
back and forth.
In mathematics, a convolution is a grouping function. In CNNs, convolution happens
between two matrices (rectangular arrays of numbers arranged in columns and rows) to
form a third matrix as an output. A CNN uses these convolutions in the convolutional
layers to filter input data and find information.
The University of Toronto researchers Alex Krizhevsky, Ilya Sutskever and Geoffrey
Hinton trained a deep convolutional neural network to classify the 1.2 million images
from the ImageNet Large Scale Visual Recognition Challenge contest, winning with a
record-breaking reduction in error rate [12]. This sparked today’s modern AI boom.
The convolutional layer does most of the computational works in a CNN. It acts as the
mathematical filters that help computers find edges of images, dark and light areas,
colors, and other details, such as height, width and depth.
There are usually many convolutional layer filters applied to an image.
Pooling layer: Pooling layers are often sandwiched between the convolutional layers.
They’re used to reduce the size of the representations created by CNN and reduce the
memory requirements, which allows for more convolutional layers.
36

Normalization layer: Normalization is a technique used to improve the performance and
stability of neural networks. There are different types of normalization available in CNN.
Those are Weight Normalization [38], Layer Normalization [39], and Batch
Normalization [40].
Fully connected layers: Fully connected layers connect every neuron in one layer to
every neuron in another layer. It is using the same principle as the traditional multi layer
perceptron neural network (MLP). The flattened matrix goes through a fully connected
layer to classify the images.
Then the back propagation is used to calculate the gradients of error with respect to all
the weights in the network. Back propagation is the method by which a neural network is
trained. It doesn't have much to do with the structure of the network, but rather implies
how input weights are updated. When training a feed forward network, the information is
passed into the network, and the resulting classification is compared to the known
training sample. If the network's classification is incorrect, the weights are adjusted
backward through the network in the direction that would give it the correct
classification. This is called the backward propagation of the training. So CNN is a feedforward network, but is trained through back-propagation.
CNNs are ideally suited for computer vision, but feeding those enough data can make
them useful in videos, speech, music and text as well.
4.3.1.2 Back Propagation
Algorithm 3. Back Propagation algorithm.
Consider a network with a single real input x and network function P. The derivative
P'(x) is computed in two phases: (1) Feed-forward: the input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated at each node. The
derivatives are stored. (2)Back propagation: the constant 1 is fed into the output unit and
the network is run backwards. Incoming information to a node is added and the result is
37

multiplied by the value stored in the left part of the unit. The result is transmitted to the
left of the unit. The result collected at the input unit is the derivative of the network
function with respect to x.
Back propagation is based around four fundamental equations. Together, those equations
give us a way of computing both the error δl and the gradient of the cost function. The
four equations are shown below [41].
An equation for the error in the output layer, δL: The components of δL are given by

This is a very natural expression. The first term on the right, ∂C/∂aLj, just measures how
fast the cost is changing as a function of the jth output activation. If, for example, C
doesn't depend much on a particular output neuron, j, then δLj will be small, which is as
expected. The second term on the right, σ′(zLj ), measures how fast the activation function
σ is changing at zLj.
An equation for the error δl in terms of the error in the next layer, δl+1:

where (wl+1)T is the transpose of the weight matrix wl+1 for the (l+1)th layer. When we
apply the transpose weight matrix (wl+1)T, we can think of it as moving the error
backward through the network, which gives some sort of measure of the error at the
output of the lth layer. This moves the error backward through the activation function in
layer l, which gives us the error δl in the weighted input to layer l.
38

By combining (BP2) with (BP1), the error δl can be computed for any layer in the
network.
An equation for the rate of change of the cost with respect to any bias in the
network:

The error δlj is exactly equal to the rate of change ∂C/∂blj , which is the same as,
calculating error by (BP1) and (BP2) to compute δlj . We can rewrite (BP3)
as ∂C/∂b = δ , where δ is being evaluated at the same neuron as the bias b.

An equation for the rate of change of the cost with respect to any weight in the
network:

This gives us to compute the partial derivatives ∂C/∂wljk in terms of the quantities δl and
al-1 .The equation can be rewritten as

where it's shown that, ain is the activation of the neuron input to the weight w, and δout is
the error of the neuron output from the weight w.
39

If we look at the weight w, and the two neurons connected by that weight, we can depict
this as:

Figure. 10 Two Connected Neurons with weights
The above back propagation rules are summarized in Figure 11.

Figure.11 Back Propagation Rule

40

Figure. 12 Convolution Neural Network
The overall training process of the Convolution Network may be summarized as below:
Step1: We initialize all filters and parameters / weights with random values
Step2: The network takes a training image as input, goes through the forward
propagation step (convolution, ReLU and pooling operations along with forward
propagation in the Fully Connected layer) and finds the output probabilities for each
class.


Let’s say the output probabilities for the boat image above are [0.2, 0.4,
0.1, 0.3]



Since weights are randomly assigned for the first training example,
output probabilities are also random.

Step3: Calculate the total error at the output layer (summation over all 4 classes)
Total Error = ∑ ½(target probability– output probability)²
Step4: Use Backpropagation to calculate the gradients of the error with respect to
all weights in the network and use gradient descent to update all filter values /
weights and parameter values to minimize the output error.

41



The weights are adjusted in proportion to their contribution to the total
error.



When the same image is input again, output probabilities might now be
[0.1, 0.1, 0.7, 0.1], which is closer to the target vector [0, 0, 1, 0].



This means that the network has learnt to classify this particular image
correctly by adjusting its weights / filters such that the output error is
reduced.



Parameters like number of filters, filter sizes, architecture of the network
etc. have all been fixed before Step 1 and do not change during training
process – only the values of the filter matrix and connection weights get
updated.

Step5: Repeat steps 2-4 with all images in the training set.
The CNN have now been optimized to correctly classify images from the training
set.
4.3.1.3 RNN
The major limitation of CNN is that they accept a fixed-sized vector as input and produce
a fixed-sized vector as output which is the probabilities of different classes. Then these
models perform the mapping using a fixed amount of computational steps or the number
of layers in the model. They are enlisted as giant sequence of filters or neurons in these
hidden layers that all optimize toward efficiency in identifying an image. Therefore,
CNNs are called “feed-forward” neural networks because information is fed from one
layer to the next. However, RNN is trained to recognize patterns across time, while a
CNN learns to recognize patterns across space and hence a CNN learns to recognize
components in an image like lines, edges, curves, etc.

42

RNN offers two major advantages:
Store Information
The recurrent network can use the feedback connection to store information over time in
form of activations. This ability is significant for many applications. In the recurrent
networks, they have some form of memory.
Learn Sequential Data
The main reason for using RNN, they allow us to operate over sequences of vectors. In
Figure. 13, with RNN approach one to many, many to one and many to many inputs to
outputs are possible.

Figure. 13 RNN Sequential Data Learning Approach
In Figure. 13, each rectangle is a vector and arrows represent functions (e.g. matrix
multiply). Input vectors are in red, output vectors are in blue and green vectors hold the
RNN's state. The RNN can handle sequential data of arbitrary length. From left to right as
shown in Figure 11: (1) On the left the default feed forward CNN is shown, which can
just compute from fixed-sized input to fixed-sized output (e.g. image classification). (2)
43

Sequence output (e.g. image captioning takes an image and outputs a sentence of words).
(3) Sequence input (e.g. sentiment analysis where a given sentence is classified as
expressing positive or negative sentiment). (4) Sequence input and sequence output (e.g.
Machine Translation: an RNN reads a sentence in English and then outputs a sentence in
French). (5) Synced sequence input and output (e.g. video classification where we wish to
label each frame of the video).
Notice that in every case, there are no pre-specified constraints on the length sequences
because the recurrent transformation (green) is fixed and can be applied as many times as
required.
Recurrent neural networks (RNNs) are connectionist models that capture the
dynamics of sequences via cycles in the network of nodes. Unlike standard CNNs, RNNs
retain a state that can represent information from an arbitrarily long context window.
RNNs combine the input vector with their state vector with a fixed (but learned) function
to produce a new state vector. All recurrent neural networks have the form of a chain of
repeating modules of neural network as shown in Figure 14. In standard RNNs, this
repeating module will have a very simple structure, such as a single tanh layer.

Figure. 14 Simple RNN Structure
44

Computational Power of Recurrent Networks
From the point of view of automata theory, all that is relevant is the identification of a set
of internal states which characterize the status of the device at a given moment in time,
together with the specification of rules of operation which predict the next state on the
basis of the current state and the inputs from the environment [42].
Theorem 1: Rational-weighted RNNs having boolean activation functions (simple
thresholds) are equivalent to finite state automata [43].
Proof: Proof shown in [43]
Theorem 2: Rational-weighted RNNs having linear sigmoid activation functions are
equivalent to Turing Machines [44].
Proof: Proof shown in [44]
Theorem 3: Real-weighted RNNs having linear sigmoid activation functions are more
powerful than Turing Machines. Siegelmann and Sontag noted that these networks are
not likely to solve polynomially NP-hard problems, as the equality “P=NP” in their
model implies the almost complete collapse of the standard polynomial hierarchy [45].
Proof: Proof shown in [45]
Theorem 4:
All Turing machines may be simulated by fully connected recurrent networks built of
neurons with sigmoidal activation functions [46].
In his model, all neurons synchronously update their states according to a quadratic
combination of past activation values. Proof: Proof shown in [46]

45

Long-Term Dependencies Problems
What happened to Recurrent Networks? One major drawback of RNNs is that the range
of contextual information is limited and the Backpropagation through time (BPTT) [47]
does not store information over long time period. This is noticeable in either vanishing or
exploding outputs of the network, which is known as vanishing gradient problem or
exploding gradient problem [48].
These problems arise during training of a deep network when the gradients are being
propagated back in time all the way to the initial layer. The gradients coming from the
deeper layers have to go through continuous matrix multiplications because of the chain
rule, and as they approach the earlier layers, if they have small values (<1), they shrink
exponentially until they vanish and make it impossible for the model to learn, this is the
vanishing gradient problem. While on the other hand if they have large values (>1) they
get larger and eventually blow up and crash the model, this is the exploding gradient
problem.
Dealing with Exploding Gradients
When gradients explode, it become NaN because of the numerical overflow, which
results irregular oscillations in training cost when the learning curve is plotted. A solution
to fix this is to apply gradient clipping; which places a predefined threshold on the
gradients to prevent it from getting too large, and by doing so, it doesn’t change the
direction of the gradients but it only changes its length.
4.3.1.4 LSTM
What makes LSTM so desirable? For dealing with Vanishing Gradients, Long ShortTerm Memory architecture (LSTM) is most popular and a widely used approach. This is
a different variant of RNN which was designed to make it easy to capture long-term
dependencies in sequence data. The standard RNN operates in such a way that the hidden
state activations are influenced by the other local activations closest to them, which
46

corresponds to a “short-term memory”, while the network weights are influenced by the
computations that take place over entire long sequences, which corresponds to a “longterm memory”. Hence the RNN was redesigned so that it has an activation state that can
also act like weights and preserve information over long distances, hence the name “Long
Short-Term Memory” [4].
4.3.1.5 Distributed LSTM
What is the need of distributed machine for LSTM? Recurrent neural networks (RNNs)
have been widely used for processing sequential data. However, RNNs are commonly
difficult to train due to the well-known gradient vanishing and exploding problems and
hard to learn long-term patterns. Long short-term memory (LSTM) and gated recurrent
unit (GRU) were developed to address these problems [49].
The LSTM architectures are usually trained in a batch setting in the architecture, where
all data instances are present and processed together. However, for applications involving
big data, storage issues may arise due to keeping all the data in one place. Additionally,
in certain frameworks, all data instances are not available beforehand since instances are
received in a sequential manner, which precludes batch training. As every second the data
size is growing exponentially, in coming years most big corporations will suffer from
computational power and storage issues due to large amount of data. As an example, in
tweet emotion recognition applications, the systems are usually trained using an
enormous amount of data to achieve sufficient performance, especially for agglutinative
languages [50].
In the common distributed architectures, the whole data is distributed to different nodes
but the trained parameters are merged later at a central node. However, this centralized
approach requires high storage capacity and computational power at the central node.
Additionally, centralized strategies have a potential risk of failure at the central node. To
circumvent these issues, we distribute both the processing as well as the data to all the
47

nodes and allow communication only between neighboring nodes, hence, we remove the
need for a central node. In particular, each node sequentially receives a variable length of
data sequence with its label and exchanges information only with its neighboring nodes
to train the common LSTM parameters. There are two approaches to achieve this
architecture. By the use of parameter server framework, this scalable distributed deep
learning approach can be achieved where both data and workloads are distributed over
worker nodes, while the server nodes maintain global shared parameters, represented as
dense or sparse vectors and matrices. Here the worker nodes process data and compute
local gradients on a mini-batch. They then send push (key, gradient) messages to the
servers. Those process the updates asynchronously. When needed, the workers pull them
back with a pull (key) request. A lot of the infrastructure is borrowed from distributed
(key, value) storage such as memcached. Memcached is a high-performance, distributed
memory object caching system, generic in nature, but originally intended for use in
speeding up dynamic web applications by alleviating database load[51].The framework
manages asynchronous data communications between nodes and supports flexible
consistency models, elastic scalability and continuous fault tolerance[8].
The other approach is synchronous distributed stochastic gradient descent (SGD), which
is known as distributed synchronous SGD. In practice, each training node performs the
forward-backward pass on different batches sampled from the training dataset with the
same network model. The gradients from all nodes are summed up to optimize their
models. By this synchronous step, models of different nodes are always the same during
the training. The aggregation step can be achieved by performing the All-reduce
operation on the gradients among all nodes and to update the parameters on each node
independently [52].
4.3.1.5.1

Synchronous all-reduce SGD

In traditional synchronous all-reduce SGD, there are two alternating phases proceeding in
lock-step:(1) each node computes its local parameter gradients, and (2) all nodes
48

collectively communicate all-to-all to compute an aggregate gradient, as if they all
formed a large distributed minibatch.
The second phase of exchanging gradients forms a barrier and is the communicationintensive phase, usually implemented by an eponymous all-reduce operation. The time
complexity of an all-reduction can be decomposed into latency-bound and bandwidthbound terms. Although the latency term scales with O (log (p)), there are fast ring
algorithms which have bandwidth term independent of p [52]. With modern networks
capable of handling bandwidth on the order of 1–10 GB/s combined with neural network
parameter sizes on the order of 10–100 MB, the communication of gradients or
parameters between nodes across a network can be very fast.
Instead, the communication overhead of all-reduce results from its use of a
synchronization barrier, where all nodes must wait for all other nodes until the all-reduce
is complete before proceeding to the next stochastic gradient iteration. This directly leads
to a straggler effect where the slowest nodes will prevent the rest of the nodes from
making the progress. [53]
Algorithm 4. Synchronous all-reduce SGD
Initialize θ0, i ← θ0
for t ∈ {0. . . T} do
∆θt,i ← −αt∇fi (θt,i ; Xt,i) + µ∆θt-1
∆θt ← all-reduce-average (∆θt,i )
θt+1,i ← θt,i + ∆θt
end for

49

4.3.2 Baseline LSTM

LSTM is an extension of recurrent neural networks. Due to its special architecture, which
combats the vanishing and exploding gradient problems, it is good at handling time series
problems up to a certain depth. The input gate, the forget gate, and the output gate of LSTM
are designed to control what information should be forgotten, remembered, and updated.

Figure. 15 LSTM Forget Gate
As shown in Figure. 15, First there is a need to forget old information, which involves the
forget gate. In the first step of LSTM forget gate looks at ht-1 and xt to compute the
output ft which is a number between 0 and 1 for each cell state number. This is multiplied
by the cell state Ct-1 and yield the cell to either forget everything or keep the information
which is based on zero or one. For example a value of 0.5 means that the cell forgets 50%
of its information. It is considered a good practice to initialize these gates to a value of 1,
or close to 1, so as to not impair training at the beginning.

50

Figure. 16 LSTM Input Gate

As shown in Figure. 16, the next step is to determine what new information needs to keep
in memory with an input gate. This has two parts. First, a sigmoid function called the “input
gate” decides which values need to update. Next, a tanh function creates a vector of new
candidate values, Ct, which could be added to the state. From that, it is possible to update
the old cell state, to the new cell state, Gating is a method to selectively pass the needed
information.

Figure. 17 LSTM Processing Data
51

As shown in Figure. 17, now LSTM will update the old cell state Ctt-1, into the new cell
state Ct. It multiply the old state by ft , forgetting the things it decided to forget earlier. Then
it adds it∗Ct. This is the new candidate value, scaled by LSTM’s decision to update each
state value.

Figure. 18 LSTM Output Gate
Finally the output value has to be computed, which is done by multiplying ot with the tanh
of the result of the previous step, which yields to ht=ot∗tanh(Ct) and
ot=σ∗(Wo [ht-1,xt]+bo). Finally, it decides which information should be output to the layer
above with an output gate.
In the LSTM cell, each parameter at moment t can be defined as follows:
ft = σ (Wf [ht-1 , xt ] + bf)
it = σ (Wi [ht-1 , xt ] + bi)
Ct = tanh (Wc [ht-1, xt ] + bc)
Ct = ft * Ct-1 + it * Ct
ot = σ (Wo [ht-1 , xt ] + bo)
ht = ot * tanh (Ct)
52

Figure. 19 The unfolded structure of one-layer baseline LSTM
In Figure 19, We define the input set as {x0,x1,…,xt,xt+1,...} and the output set as
{y0,y1,…,yt,yt+1,...} and hidden layers as {h0,h1,…,ht,ht+1,...}. Then, U, W, V denote weight
metrics from the input layer to the hidden layer, from the hidden layer to the hidden layer,
and from the hidden layer to the output layer respectively. Baseline LSTM structure
operating through the time axis, from left to right. The transfer process of the network can
be described as follows: the input tensor is transformed along with the tensor of the hidden
layer (at the last stage), to the hidden layer by a matrix transformation. Then, the output of
the hidden layer passes through an activation function to the final value of the output layer.
Formally, outputs of the hidden layer and output layer can be defined as follows:
53

g(Uxi + bih)
hi =
yi =

where I = 0

g(Uxi + Whi-1 + bih) where i = 1,2,…
g(Vhi + biy)

where i = 0,1,...

4.3.3 Bidirectional LSTM
Baseline LSTM cells predict the current status based only on former information. It is clear
that some important information may not be captured properly by the cell if it runs in only
one direction. Bidirectional LSTM have been successfully applied for emotion recognition
from low level frame-wise audio features which requires modeling of long range context
along both input directions [52].

Figure. 20 The structure of single layer bidirectional LSTM
54

As shown in Figure. 20, the bidirectional layer gets information from vertical direction
(lower layer) and horizontal direction (past and future) from two separate hidden layers,
and finally outputs the processed information for the upper layer. There are forward
sequences h⃗ from left to right with green arrows and backward sequences ←h from right
to left with red arrows in the hidden layer. For the moment, t (0, 1, 2...) the hidden layer
and the output layer can be defined as followed.
(→) ht = g(Uh xt + Wh ht-1 + bh)
(←) ht = g(Uh xt + Wh ht-1 + bh)
yt = g(Vhht → + Vh ht← + by )

4.3.4 Residual LSTM
The Microsoft Research Asia (MSRA) team built a 152-layer network, On the ImageNet
dataset the team evaluate residual nets with a depth of up to 152 layers—8× deeper than
VGG nets [54] but still having lower complexity. This result won the 1st place on the
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 classification task.
The depth of representations is of central importance for many visual recognition tasks.
A residual network [54] provides an identity mapping by shortcut paths. Since the identity
mapping is always on, function output only needs to learn residual mapping. Formulation
of this relation can be expressed as:
y = F(x; W) + x
where y is an output layer, x is an input layer and F(x; W) is a function with an internal
parameter W. Without a shortcut path, F(x; W) should represent y from input x, but with
an identity mapping x, F(x; W) only needs to learn residual mapping, y − x. As layers are
stacked up, if no new residual mapping is needed, a network can bypass identity mappings
without training, which could greatly simplify training of a deep network.

55

As the network deepens, the research emphasis shifts on how to overcome the obstruction
of information and gradient transmission. The MSRA uses residual networks with the main
idea that it is easier to optimize the residual mapping than to optimize the original,
unreferenced mapping. An important advantage of residual networks is that they are much
easier to train because the gradients can be passed through the layers more directly
with the addition operator that enables them to bypass some layers that would have
otherwise been restrictive. This enables both better training and a deeper network, because
residual connections do not impede gradients and still contribute to refining the output of
a highway layer composed of such residual connections [55].
Skip connections made the training of very deep networks possible and have become an
indispensable component in a variety of neural architectures. The difficulty of training deep
networks is partly due to the singularities caused by the non-identifiability of the model.
Several such singularities have been identified in previous works: (1) overlap singularities
caused by the permutation symmetry of nodes in a given layer, (2) elimination singularities
corresponding to the elimination, i.e. consistent deactivation, of nodes, (3) singularities
generated by the linear dependence of the nodes. These singularities cause degenerate
manifolds in the loss landscape that slow down learning. We argue that skip connections
eliminate these singularities by breaking the permutation symmetry of nodes, by reducing
the possibility of node elimination and by making the nodes less linearly dependent.
Moreover, for typical initializations, skip connections move the network away from the
“ghosts” of these singularities and sculpt the landscape around them to alleviate the
learning slow-down. These hypotheses are supported by evidence from simplified models,
as well as from experiments with deep networks trained on real-world datasets. [56]

56

Figure. 21 The structure of single layer residual LSTM
The lower information can transmit to upper layer directly through a highway, which
increases the freedom of the information flowing. The highway structure containing skip
connections can connect many supplementary n (n=0, 1, 2…) layers in height before the
bottleneck. When n equals 0, there is no residual connection: it becomes like the baseline
deep-stacked LSTMs layers. The output of the hidden layer i (i=1,2,...L)can be defined as
follows:
h1 = σ(W1x + b1)

where i = 1

hi = σ(Wi hi-1 + bi) + hi -1 where i = 2,3,…,L-1
y = σ(Wy hi-1 + by) + hi -1 where i = L
57

During the code implementation, indexing in the configuration file starts at one rather than
zero because we included the count of the first layer that acts as a basis before the residual
cells. The same counting rule applies for indicating how many blocks of residual highway
layers are stacked one on top of the other.

4.3.5 Deep Residual Bidirectional LSTM
The deep bidirectional LSTM (BDLSTM) architectures are networks with several
bidirectional stacked LSTM hidden layers, in which the output of a LSTM hidden layer
will be fed as the input into the subsequent LSTM hidden layer. This stacked layers
mechanism enhances the power of neural networks [57]. Previous research [58] has shown
that, the BDLSTM takes the spatial time series data as the input and predict future speed
values for one time-step. The BDLSTM is also capable of predicting values for multiple
future time steps based on historical data. When feeding the spatial-temporal information
of the traffic network to the BDLSTMs, both the spatial correlation of the speeds in
different locations of the traffic network and the temporal dependencies of the speed values
can be captured during the feature learning process. In this regard, the BDLSTMs are very
suitable for being the first layer of a model to learn more useful information from spatial
time series data. When predicting future speed values, the top layer of the architecture only
needs to utilize learned features, namely the outputs from lower layers, calculates
iteratively along the forward direction and generates the predicted values. But as
complexity and volume of data grows the model may not work due to the obstruction of
information and gradient transmission as discussed in residual LSTM section. In general,
gradient vanishing is a widespread problem for deep networks. Then there is a need for a
hybrid LSTM model which would work on those cases. The residual, bidirectional, and
stacked layers (hence, the name “Deep Residual Bidirectional LSTM” (RBDLSTM)) [59]
help counter this problem, because some bottom layers would otherwise be too hard to
optimize when using backpropagation.
58

The RBDLSTM layer contains a BDLSTM layer as the first feature-learning layer and a
LSTM layer as the last layer. For sake of making full use of the input data and learning
complex and comprehensive features, the RBDLSTM includes one or more middle
BDLSTM layers along with residual LSTM layers. These architectures can take formation
of 2 x 2 layers, 3 x 3 layers or 4 x 4 layers depending on the complexity nature of the issues
along with learning rate, where there would be n residual layers which contains each n
bidirectional hidden layers. Combined with batch normalization on the top of each residual
layer, residual connections act as shortcut for gradients. It prevents restrictions in the
hidden layer feature space from being too complex and avoids outlier values at test time,
against overfitting.
In Figure 22, the information flows bidirectional fashion in the horizontal direction
(temporal dimension) and unidirectional fashion in the vertical direction (depth
dimension). With the exception of the input and output layers, there are 2 hidden layers
which have residual connection inside (hence, called “residual layer”). Moreover, each
residual layer contains 2 bidirectional layers. The network in Figure. 22 demonstrated
2 x 2 architecture, which can also be thought of as 8 LSTM cells in sum working as a
network. In our network, the activity function is unified with ReLU, because it always
outperforms with deep networks to counter gradient vanishing. Although the output is a
tensor for a given time window, the time axis has been crunched by the neural network.
That is, we need only the last element of the output and can discard the others. Thus, only
the gradient from the prediction at the last time step is applied. This also causes a LSTM
cell to be unnecessary: the uppermost backward LSTM in the bidirectional pass. Hopefully,
this is not of great concern because TensorFlow should evaluate what to compute and what
not to compute. Additionally, the training dataset should be shuffled during the training
process. The state of the neural network is reset at each new window for each new
prediction.
59

Figure. 22 The structure of 2 x 2 residual bidirectional LSTM
60

Figure. 23 The residual bidirectional LSTM components
The residual bidirectional LSTM is the hybrid of all the above layers, shown below the
output of hidden layer and output layer in a series as follows:
Stacked LSTM without residual connections:
Let LSTMi and LSTMi+1 be the ith and (i+1)th LSTM layers in a stack, whose parameters
are Wi and Wi+1 respectively. At the tth time step, for the stacked LSTM without residual
connections, we have:
cti , mti = LSTMi (ct-1i , mt-1i , xti-1 ; Wi )
xti = mti ,
cti+1 ,mti+1 = LSTMi+1(ct-1i+1 , mt-1i+1 , xti ; Wi+1 )
where xti is the input to LSTMi at time step t, and mti and cti are the hidden states and
memory states of LSTMi at time step t, respectively.

61

Stacked LSTM with residual connections:
With residual connections between LSTMi and LSTMi+1, the above equations become:
cti , mti = LSTMi (ct-1i , mt-1i , xti-1 ; Wi )
xti = mti + xti-1 ,
cti+1 ,mti+1 = LSTMi+1(ct-1i+1 , mt-1i+1 , xti ; Wi+1 )
Residual connections greatly improve the gradient flow in the backward pass, which allows
us to train very deep networks.
Stacked LSTM with residual bidirectional connections:
In an LSTM stack with residual connections there are two accumulators: cti along the time
axis and xti along the depth axis. In theory, both of the accumulators are unbounded, but in
practice, we noticed their values remain quite small. For inference, we explicitly constrain
the values of these accumulators to be within [-δ, δ] to guarantee a certain range that can
be used for calculation purpose later. The forward computation of an LSTM stack with
residual connections is modified to the following:
ct'i , mti = LSTMi (ct-1i , mt-1i , xti-1 ; Wi )
ct'i = max(−δ, min(δ, ct'i ))
xt'i = mti + xti-1 ,
xti = max(−δ, min(δ, xt'i))
ct'i+1 ,mti+1 = LSTMi+1(ct-1i+1 , mt-1i+1 , xti ; Wi+1 )
cti+1 = max(−δ, min(δ, ct'i+1))
It can be quantized further with effective quantization methods by reducing bit-widths of
weights, activations and gradients of a neural network which can shrink its storage size and
62

memory usage, and also allow for faster training and inference by exploiting bitwise
operations.[60]. This area is not researched in this thesis.

63

CHAPTER V
TESTBED SETUP
In this section, we are going to set up the hardware for this research and run the
simulation programs to verify that the platform is ready for the research. This introduces
distributed machine learning where we need cluster of machines, which are either
connected physically with each other or connected by the web networks. Here we built a
Raspberry Pi cluster which consists of 16 Raspberry Pi 3 B+ models connected together
by a switch hub where the switch is connected to LAN of the research lab. The next
platfom we built a NVIDIA GPU cluster which consists of 3 GPUs Tesla K40c, Quadro
P5000 and Quadro K620 on top of a multicore CPU with 32 GB RAM and 2 TB SSD
with 10 TB HDD space. We presented every details of set up with simulation results in
below section.

5.1 NVIDIA GPU Test Bed Setup
We use Ubuntu 18.04 LTS 64-bit version for our development environments to perform
the experiments. We use one multicore CPU machine with enough memory and disk
space to support 3 GPUs named Tesla K40c, Quadro P5000 and Quadro K620. In this
experiment, we have used NVIDIA Maximus formation by using the computational
power of NVIDIA Tesla GPU and visualization power of NVIDIA Quadro GPU. This is
the most efficient formation recommended by NVIDIA for deep learning peforamnce.
The cluster of this GPUs is connected by 100 Mbps LAN. Hardware configuration of the
machine along with NVIDIA GPUs are listed in Table 2, 3 and the details workstation set
64

up is described in APPENDIX A.

Processor

Dual Intel Xeon E5-2609 v4, 8-Core, 1.7
Ghz, 20MB L3 Cache, 85 Watts

Memory

32GB DDR4- 2400MHz (4 x 8GB)

Motherboard

Asus Z10PE-D16 WS Intel Xeon

Power Supply

750 Watt EGVA SuperNOVA, 80Plus
Bronze Certified

Hard Drive 1

2TB Samsung 960 Pro PCIe 3.0 SSD

Hard Drive 2

2 x 4 TB 7200rpm SATA 600 with 64MB
Cache

GPUs

Tesla K40c, Quadro P5000, Quadro K620

Table. 2 Hardware configuration of CPU
Device 0:

Quadro P5000

CUDA Driver Version / Runtime Version

10.0 / 10.0

CUDA Computation Capability Version

6.1

Total amount of global memory

16279 MBytes (17069309952 bytes)

Total CUDA Cores

2560

Multiprocessors

(20) Multiprocessors, (128) CUDA Cores

GPU Max Clock rate

1734 MHz (1.73 GHz)

Memory Clock rate

4513 Mhz
65

Memory Bus Width

256-bit

Maximum Texture Dimension Size (x,y,z)

1D=(131072), 2D=(131072, 65536),
3D=(16384, 16384, 16384)

Device 1:

Tesla K40c

CUDA Driver Version / Runtime Version

10.0 / 10.0

Total amount of global memory

11441 MBytes (11996954624 bytes)

Total CUDA Cores

2880

Multiprocessors

(15) Multiprocessors, (192) CUDA Cores

GPU Max Clock rate

745 MHz (0.75 GHz)

Memory Clock rate

3004 Mhz

Memory Bus Width

384-bit

Maximum Texture Dimension Size (x,y,z)

1D=(65536), 2D=(65536, 65536),
3D=(4096, 4096, 4096)

Device 2:

Quadro K620

CUDA Driver Version / Runtime Version

10.0 / 10.0

Total amount of global memory

2000 MBytes (2096955392 bytes)

Total CUDA Cores

384

Multiprocessors

( 3) Multiprocessors, (128) CUDA Cores

GPU Max Clock rate

1124 MHz (1.12 GHz)

Memory Clock rate

900 Mhz

Memory Bus Width

128-bit

Maximum Texture Dimension Size (x,y,z)

1D=(65536), 2D=(65536, 65536),
66

3D=(4096, 4096, 4096)
Warp size (same for all)

32

Table. 3 Hardware configuration of GPUs

As shown in Table. 3, the machine is installed with CUDA 10.0.130 along with
compatible cuDNN 7.5 for the TensorFlow and PyTorch.
The details process of installation of CUDA and cuDNN are mentioned in APPENDIX –
A. Here we have mentioned the verification process after installation.
Recommended Actions for Installation Verifications
1. Check the .bashrc after reboot.
2. Verify the installed driver version. If driver is installed correctly it will be loaded by
the below command.
$ cat /proc/driver/nvidia/version

Figure.24 NVIDIA Driver Version
3. Verify the CUDA Toolkit Version by the below command.
$ nvcc -V

67

Figure.25 CUDA Toolkit Version
4. Compile the CUDA Examples
In order to modify, compile and run samples it must be installed with write permission.
Please run the script which is already available in the CUDA installed directory.
cuda-install-samples-10.0.sh ~
which will copy the samples to the home directory. Once the copying is finished please
run the below command to compile the samples.
cd ~/NVIDIA_CUDA-10.0_Samples/5_Simulations/nbody
make ./nbody
5. Run the Binaries
After compilation, run the deviceQuery under Samples folder, by below command.
./deviceQuery
If the CUDA software is installed and configured correctly the output of the deviceQuery
would show pass statement as shown below.
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime
Version = 10.0, NumDevs = 3
Result = PASS
Run the bandwidthTest for verification by below command.
./p2pBandwidthLatencyTest
68

If the CUDA software is able to connect to other GPU drivers then the below matrix will
come in result page which validates the successful installation of CUDA.

P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU

0

1

2

0 1.27 15.42 14.41
1 14.54 4.28 16.47
2 13.74 16.19 3.75

CPU

0

1

2

0 6.07 14.02 13.95
1 14.14 5.99 13.97
2 13.95 13.80 6.10
Test passed!

6. Verify the cuDNN validation test after successful installation of cuDNN.
To verify that cuDNN is running properly, compile the mnistCUDNN sample located in
the /usr/src/cudnn_samples_v7 directory in the debian file installation folder.
Steps:
1. Copy the cuDNN sample to a writable path.
$cp -r /usr/src/cudnn_samples_v7/ $HOME
2. Go to the writable path.
69

$ cd $HOME/cudnn_samples_v7/mnistCUDNN
3. Compile the mnistCUDNN sample
$make clean && make
4. If face any issues, open the file /usr/include/cudnn.h & change below details & save it.
#include "driver_types.h" → #include
5. Run the mnistCUDNN sample.
$ ./mnistCUDNN
6. If cuDNN is properly installed and running on you Linux machine you will see the
similar message as above.
Test passed!
The above complete test results files are provided in the APPENDIX -A.
After above installation now the machine is compatible for running deep learning models
but still clustering is ready to set.
There are basically two options how to do multi-GPU programming. First option to do it
in CUDA and have a single thread and manage the GPUs directly by setting the current
device and by declaring and assigning a dedicated memory-stream to each GPU or the
other options is to use CUDA_Aware_MPI where a single thread is spawned for each
GPU and all communication and synchronization is handled by MPI.
We have choose to go by the first option where the clustering is done based on
cudaSetDevice query.

70

Figure.26 GPU Memory Array
The result of the NVIDIA Maximus cluster formation is shown below.
< multiple host threads can use ::cudaSetDevice() with device simultaneously >
> Peer access from Quadro P5000 (GPU0) -> Tesla K40c (GPU1) : Yes
> Peer access from Quadro P5000 (GPU0) -> Quadro K620 (GPU2) : Yes
> Peer access from Tesla K40c (GPU1) -> Quadro P5000 (GPU0) : Yes
> Peer access from Tesla K40c (GPU1) -> Quadro K620 (GPU2) : Yes
> Peer access from Quadro K620 (GPU2) -> Quadro P5000 (GPU0) : Yes
> Peer access from Quadro K620 (GPU2) -> Tesla K40c (GPU1) : Yes
We have implemented the Synchronous All-reduce approach which is the default
behavior of the distributed TensorFlow, MirroredStrategy API.
Both of these examples implement the All-reduce approach, however they can be easily
extended to other approaches. Here 3 GPUs are working as actors to mirror the task
which is taken care by TensorFlow in Figure. 27.

71

Figure.27 Distributed TensorFlow API

5.2 Cluster of Raspberry Pis Setup
We have used Raspberry Pi 3 Model B+ in this experiment. There is a perfect reason to
use raspberry pi in this research.
The raspberry pi board comprises a program memory (RAM), processor, graphics chip,
CPU, GPU, Ethernet port, GPIO pins, Xbee socket, UART, power source connector and
various other interfaces for other external devices. We have added a 32 GB flash memory
SD card which could be used as storage in each pi. So that raspberry pi board will boot
from this SD card similarly as a PC boots up into windows from its hard disk. This tiny
computer having all the qualities with very cost effective price, a perfect candidate to
build large clusters for research purpose.
Hardware configuration of Raspberry Pi is listed in the Table. 4.
There are 16 Pi’s used to make this cluster. The Raspbian Stretch Kernel Version 4.14 is
installed in each Pi. TensorFlow 1.8.0 the .whl file version is installed in each Pi. To
create the sharing folder between the Pis NFS server and client model is implemented as
it is supported by Linux terminal.
72

Processor

Broadcom BCM2837B0, Cortex-A53,
64-bit SoC @ 1.4GHz

Memory

1GB LPDDR2 SDRAM

Hard Drive 1

Samsung 32 GB Flash Drive

Power Supply

5V/2.5A DC via micro USB connector

Integrated Wi-Fi

2.4GHz and 5GHz

Ethernet speed

300Mbps

Table. 4 Hardware configuration of Raspberry Pi 3 Model B+
The details of creating the sharing server and client structure is described below.
Step-1:
Create NFS server in one of the pi which is known as master pi. Before setting up the
NFS there is some prerequisites which is good to follow.
Login to the Pi configuration management file by the below command.
sudo raspi-config
1. Update the pi software.
2. Rename each pi from the default name to rpi# as per the nodes going to be used in the
cluster. You can do that from the configuration file itself and restart the pi.
3. Enable the hostname in each pi.
4. Change the password of default to your own convenient one.
5. Change the assigned memory for GPU to minimum.
73

6. Change the assigned memory for CPU to maximum.
7. In raspi-config, change (3. Boot Options > B2 Wait for Network at Boot) from “No” to
“Yes”. This will ensure that networking is available before the fstab file mounts the NFS
client.
8. In raspi-config, enable the ssh mode.
Step-2: (NFS Server)
Install the NFS server in the master node by the below command.
sudo apt-get install nfs-common nfs-server -y
sudo mkdir /home/pi/Desktop/nfsserver
sudo chmod -R 777 /home/pi/Desktop/nfsserver
This will create a server folder named nfsserver in the master pi.
Step-3:
Validate the NFS Version by the below command.
rpcinfo -u localhost nfs
Step-4:
Add the nfsserver folder to the localhost so that when other Pi add something to the
folder it will be automatically updated by the server. Please use below commands.
/home/pi/Desktop/nfsserver 192.168.1.1/26(rw,sync,no_subtree_check)
where nfsserver folder will read,write,sync with no sub tree check.
Step -5: (NFS Client)
Install the NFS client in each of the Raspberry Pi node so that we will communicate each
other by the RPC protocol which NFS uses internally to communicate.
74

1. Install the NFS client by the command.
sudo apt-get install nfs-common -y
2. Make a client directory in the Pi.
sudo mkdir -p /home/pi/Desktop/nfs
3. Give permission to the directory.
sudo chown -R pi:pi /home/pi/Desktop/nfs
4. Mount the directory to the NFS server.
sudo mount 192.168.1.26:/home/pi/De sktop/nfsserver /home/pi/Desktop/nfs
sudo nano /etc/fstab 192.168.1.26:/home/pi/Desktop/nfsserver /home/pi/Desktop/nfs
nfs rw 0 0
5. Verify the mount.
nfsstat -m
Step-6:
Restart the NFS client service so that the server will recognize the client.
sudo /etc/init.d/nfs-common restart
Step-7:
Once NFS client is installed in all the Raspberry Pi’s restart the NFS server to verify that
it is connecting to all the clinets.
sudo /etc/init.d/nfs-kernel-server restart
The below snapshot shows after NFS client server model successfully installed in all the
machines.

75

Figure.28 Raspberry Pis NFS connection

The directory is always available to all the Raspberry Pi workers along with master which
having both client and server, the status of NFS is shown below in the snapshot.

Figure.29 NFS Status

Issues on NFS Setup:
NFS server on default only allows 15 Raspberry Pi nodes as client to connect to the
server. This is the default NFS property. To increase the port please follow the below
steps.
Step-1:
Go to the nfs-kernel-server file in command prompt and change to larger number as
required.
76

sudo nano /etc/default/nfs-kernel-server
RPCNFSDCOUNT = 16
RPCMOUNTDOPTS= " --manage-gids --no-nfs-version 3"
Step-2:
Change the following things in nfs-utils file.
sudo nano /run/sysconfig/nfs-utils
RPCNFSDARGS = “16”
Step-3:
Create a directory named “sunrpc.conf” in the below location and add details as
provided below.
1. Go to the directory: /etc/modprobe.d
2. Create a file named: sunrpc.conf
3. Add the contents in the above file to allow the clients # in the NFS server:
options sunrpc tcp_slot_table_entries=128
options sunrpc tcp_max_slot_table_entries=128

5.3 Simulation using Raspberry Pis Cluster
The simulation is done with 16 Raspberry Pi cluster where 15 nodes work as worker
nodes and one works as master node as well worker node. In this cluster each task is
associated with a server. This simulation is the Monte Carlo simulation which use 16
Raspberry Pi cluster distributed TensorFlow environment to give the result of the value
of pi.
The program having two parts, one is server program server.py which is running in the
77

NFS server where each client having access and the other is the client part, client.py
which calculates the value of Pi by using a Monte Carlo method. The program generates
random points between (-1, -1) to (1, 1) in a circle of radius 1 inscribed in a square.
The source code of the Program is given in APPENDIX-E.
Distributed TensorFlow works a bit like a server-client model. The idea is that you create
a whole bunch of workers that will perform the heavy lifting. You then create a session
on one of those workers, and it will compute the graph, possibly distributing parts of it to
other clusters on the server. In order to do this, the main worker or the master, needs to
know about the other workers. This is done via the creation of a ClusterSpec as shown in
Figure. 30, which you need to pass to all workers. A ClusterSpec is built using a
dictionary, where the key is a “job name”, and each job contains many workers.
The code is taken from the simulation program where each Raspberry Pi node ip is
entitled in the taskList and cluster creates a working cluster by using API
tf.train.ClusterSpec where each job is specified as a sparse mapping from task indices to
network addresses.

78

Figure.30 TensorFlow Cluster API
In the Figure. 31, the TensorFlow API tf.device is used which is used to create a device
context such that all the operations within that context will have the same device
assignment instead of automatically selecting available devices by the program to
participate in the computational process. It allows the user to select an user specified
device for the operation.

Figure. 31 TensorFlow Device API

79

TensorFlow uses a dataflow graph to represent your computation in terms of the
dependencies between individual operations. This leads to a low-level programming
model in which you first define the dataflow graph, then create a TensorFlow session to
run parts of the graph across a set of local and remote devices. As shown in Figure. 32,
TensorFlow uses tf.Session API to create a session object which encapsulates the
environment in which Operation objects are executed, and Tensor objects are evaluated.
In simple terms, the session allocates memory to store the current value of the variable.
tf.global_variables_initializer() initializes all the variables of the TensorFlow before
using it in the operations.

Figure.32 TensorFlow Session API
In Figure. 33, the TensorFlow API tf.train.Server() is used as an in-process TensorFlow
server, for use in distributed training. A tf.train.Server instance encapsulates a set of
devices and a tf.Session target that can participate in distributed training. A server
belongs to a cluster (specified by a tf.train.ClusterSpec), and corresponds to a particular
task in a named job. The server can communicate with any other server in the same
cluster.

Figure. 33 TensorFlow Server API

80

Result
Time
Sample Size

Time 1

Time 2

Time 4

Time 8

16

10,000,000

1.495

1.18

1.32

1.961

2.34

20,000,000

2.594

1.704

1.572

2.087

2.48

30,000,000

3.673

2.284

1.848

2.299

2.57

40,000,000

4.758

2.818

2.09

2.35

2.63

50,000,000

6.561

3.373

2.391

2.48

2.79

60,000,000

7.713

3.91

2.648

2.628

2.593

70,000,000

N/A

4.451

2.923

2.782

2.421

80,000,000

N/A

4.998

3.218

2.953

2.561

90,000,000

N/A

5.678

3.482

3.147

2.842

6.103

3.741

3.234

2.611

100,000,00
0

N/A

Table. 5 Raspberry Pi Cluster Monte Carlo Simulation
As the cluster size increases, the program is computing faster for larger sample sizes but
slower for smaller sample sizes. For example, for sample size 100 million, the size 8
cluster is faster than the size 2 cluster (3.234s vs 6.103s). However, for sample size 10
million, the size 2 cluster is faster than the size 8 cluster (1.180s vs. 1.961s). The slow
down for smaller sample sizes may due to overhead for the tasks to communicate with
each other.

81

Figure. 34 Pi Cluster Execution Graph

5.4 Notes on TensorFlow Setup
The simulation program using TensorFlow built on the distributed TensorFlow APIs
where GPU cluster is used. We have going to discuss import APIs used in this research
work.

Figure. 35 TensorFlow GPU Growth API

By default, TensorFlow requests nearly all of the GPU memory of all GPUs to avoid
memory fragmentation (since GPU has much less memory, it is more vulnerable to
fragmentation). To avoid this issue, we have used the API
82

config.gpu_options.allow_growth = True as shown in Figure. 35, where TensorFlow
can grow its memory gradually when desired. In Figure. 13 it’s observed that while
computing the 4 x 4 stacked residual bidirectional layer for dataset with 256 hidden
layers which is the most complex iteration in our experiment it uses only one part of the
GPU memory.
In the Figure. 35, we have use the API allow_soft_placement= True , it would let
TensorFlow to automatically choose an existing and supported device to run the
operations in case the specified one doesn't exist, we have set allow_soft_placement to
True in the configuration option when creating the session. With this API, our program is
compatible to run in machines without having GPU clusters without giving any errors.
StreamExecutor is a unified wrapper around the CUDA and OpenCL host-side
programming models (runtimes). It lets host code target either CUDA or OpenCL devices
with identically-functioning data-parallel kernels. StreamExecutor is currently used as the
runtime for the vast majority of Google's internal GPGPU applications, and a snapshot of
it is included in the open-source TensorFlow project, where it serves as the GPGPU
runtime. As shown in Figure. 36 and 37, it inspects the capabilities of a GPU-like device
at runtime and manages multiple devices.

Figure. 36 GPU StreamExecutor

Figure. 37 GPU Device Selection
83

Figure. 38 Distributed TF Multi GPU
As shown in Figure. 38, for distributed TensorFlow we have used the API,
tf.contrib.distribute.MirroredStrategy in our program. This strategy uses one replica
per device and sync replication for its multi-GPU version. When cluster_spec is given by
the configure method, it turns into the multi-worker version that works on multiple
workers with in-graph replication. Note: configure will be called by higher-level APIs if
running in distributed environment.
In-graph replication: the client creates a single tf.Graph that specifies tasks for devices
on all workers. The client then creates a client session which talks to the master service of
a worker. Then the master will partition the graph and distribute the work to all
participating workers.
Worker: A worker is a TensorFlow task that usually maps to one physical machine. We
will have multiple workers with different task index. They all do similar things except for
one worker checkpointing model variables, writing summaries, etc. in addition to its
ordinary work.
The multi-worker version of this class maps one replica to one device on a worker. It
mirrors all model variables on all replicas. For example, in our program we have two
workers and each worker having single GPUs, it creates 2 copies of the model variables
on these 2 GPUs. Then like in MirroredStrategy, each replica performs their computation
with their own copy of variables unless in cross-replica model where variable or tensor
reduction happens.

84

5.5 Notes on using PyTorch Setup
In PyTorch program we have used the API, import torch.distributed as dist.
PyTorch distributed currently only supports Linux. By default, the Gloo and NCCL
backends are built and included in PyTorch distributed (NCCL only when building with
CUDA). As Rule of thumb, we use the NCCL backend for distributed GPU training using
CUDA.

Figure. 39 PyTorch Distributed API
In Figure. 39, The torch.distributed package provides PyTorch support and
communication primitives for multiprocess parallelism across several computation nodes
running on one or more machines. The class
torch.nn.parallel.DistributedDataParallel() builds on this functionality to provide
synchronous distributed training as a wrapper around any PyTorch model. This differs
from the kinds of parallelism provided by Multiprocessing package torch.multiprocessing and torch.nn.DataParallel() in that it supports multiple networkconnected machines and in that the user must explicitly launch a separate copy of the
main training script for each process.

Figure. 40 PyTorch Memory Shuffle
85

The task is distributed with 8 workers, and pin_memory is true so that the load of Dataset
which is on CPU, would push it during training to the GPU, so that it can speed up the
host to device transfer by enabling pin_memory.
This lets the DataLoader allocate the samples in page-locked memory, which speeds-up
the transfer.
As our hardware is single-machine synchronous case, torch.distributed or the
torch.nn.parallel.DistributedDataParallel() wrapper have below advantages over other
approaches to data-parallelism.
1. Each process maintains its own optimizer and performs a complete optimization step
with each iteration. While this may appear redundant, since the gradients have already
been gathered together and averaged across processes and are thus the same for every
process, this means that no parameter broadcast step is needed, reducing time spent
transferring tensors between nodes which decreases the execution time of the deep
learning model iteration.
2. Each process contains an independent Python interpreter, eliminating the extra
interpreter overhead and “GIL-thrashing” that comes from driving several execution
threads, model replicas, or GPUs from a single Python process. This is especially
important for models that make heavy use of the Python runtime, including models with
recurrent layers or many small components. As our program has recurrent LSTM layers
with many hidden layers it gives an advantage point during deep model iterations.

86

CHAPTER VI
IMPLEMENTATION
6.1 Best Learning Rate

Deep

Learning

Hidden

Execution

Prediction

Layers

Rate

Layers

Time

Accuracy

F1 Score

3 0.01

32 0:53:25

0.18052256

0.0552101

3 0.01

64 1:20:59

0.18052256

0.0552101

3 0.01

128 1:48:32

0.18052256

0.0552101

3 0.001

32 0:53:41

0.91177469

0.9116894

3 0.001

64 1:20:42

0.8995589

0.8998715

3 0.001

128 2:44:04

0.91923988

0.9191246

3 0.0001

32 0:53:39

0.89582628

0.8954641

3 0.0001

64 1:20:57

0.87580591

0.8761988

3 0.0001

128 3:58:56

0.89752293

0.8974544

Table. 6 Best Learning Rate

87

Learning Rate
1
0.8
0.6
0.4
0.2
0
0:53:25

1:20:59

1:48:32

0:53:41

1:20:42

2:44:04

0:53:39

1:20:57

3:58:56

32

64

128

32

64

128

32

64

128

0.01

0.01

0.01

0.001

0.001

0.001

0.0001

0.0001

0.0001

3

3

3

3

3

3

3

3

3

Prediction Accuracy

F1 Score

Figure. 41 Best Learning Rate
In the Figure 41. We have shown the prediction accuracy along with F1 score based on 3
different learning rates which are 0.01, 0.001 and 0.001. The F1 score is calculated based
on confusion matrix which is an important parameter to verify the calculated accuracy.
The learning rate 0.01 has shown the accuracy of 0.18 from Table. 2, which is bad so it
can’t be accepted as learning rate for the research. The learning rate 0.0001 has shown
the accuracy in between 85% - 90 % which is okay but when the execution time is
observed it’s very high, this time is almost double as compared to 0.01 learning rate so it
is discarded. The learning rate 0.001 has given the accuracy 0.9192 from Table. 6, which
is considered the best accuracy from 3 learning rates along with best execution time for
deep learning iterations.

88

6.2 CPU Execution Time between Layers

Figure. 42 Big Machine CPU Details
Deep Layers

Residual

Hidden Layers

Layers
32

64

128

256

Execution

Execution

Execution

Execution

Time

Time

Time

Time

2x

2 Layers

1:42:21

2:09:22

3:04:51

5:49:40

3x

3 Layers

5:18:24

9:34:22

6:44:26

16:00:54

4x

4 Layers

8:35:14

9:14:42

11:18:50

20:50:11

Table. 7 Execution rate between Layers in CPU
89

In Table. 7 , the execution time of all the 3 layers which are 2 deep layers along with 2
residual layers, 3 deep layers along with 3 residual layers and 4 deep layers along with 4
residual layers are shown which are done by the computational power of CPU. The CPU
hardware along with internal configuration is shown in Figure 42. The CPU uses its 16
core to do the computational analysis as shown in Figure. 50. Each layers having 4 types
of hidden layers which are 32 layers, 64 layers, 128 layers and 256 layers which are
having tensors to do the deep learning iterations. As shown in Figure. 43, the execution
time increases as the hidden layers increases in the same layer as shown in the bubble
chart. The smaller bubble means less execution time as compared to larger bubble which
reflects longer execution time. As the layers scale increases the execution time increases
as well. As we can see it from Figure. 44, as number of layers increases along with more
hidden layers the execution time is the more longer than previous one.

Execution Time Of CPU
300
250
200
150
100
50
0
0

0.5

1

1.5

2

2.5

3

3.5

-50

Figure. 43 Bubble Chart of CPU Execution

90

4

4.5

5

CPU Execution Time
0:00:00
21:36:00
19:12:00
16:48:00
14:24:00
12:00:00
9:36:00
7:12:00
4:48:00
2:24:00
0:00:00
Execution Time

Execution Time

Execution Time

Execution Time

32

64

128

256

2 x 2 Layers

3 x 3 Layers

4 x 4 Layers

Figure. 44 Column Graph of CPU Execution between Layers

6.3 GPU Execution Time between Layers
Deep Layers

Residual

Hidden Layers

Layers
32

64

128

256

Execution

Execution

Execution

Execution

Time

Time

Time

Time

2x

2 Layers

2:37:24

2:37:39

2:26:23

2:40:05

3x

3 Layers

6:06:51

6:16:36

6:04:10

6:09:33

4x

4 Layers

12:11:21

11:48:27

12:19:01

12:30:06

Table. 8 Execution rate between Layers in GPU

In the Table. 8, the execution time of 2 x 2 layers, 3 x 3 layers and 4 x 4 layers along with
32 hidden layers, 64 hidden layers, 128 hidden layers and 256 hidden layers are shown
which are done by the GPU cluster.
91

The graphical representation of the execution time of each layer along with hidden layers
are shown as bubble chart in Figure. 45 and as column bar graph in Figure. 46. Here we
found an interesting thing. The deep layers of network takes a certain amount of GPU
execution time which is irrelevant of the number of deep layers. As shown in Figure. 46,
the execution time of 4 x 4 layers of network is approximately same for all the 4 hidden
layers which can be referenced from the Figure. 45 as well.

Execution Time GPU
300
250
200
150
100
50
0
0

0.5

1

1.5

2

2.5

3

3.5

-50

Figure. 45 Bubble Chart of GPU Execution

92

4

4.5

5

GPU Execution Time
14:24:00
12:00:00
9:36:00
7:12:00
4:48:00
2:24:00
0:00:00
Execution Time

Execution Time

Execution Time

Execution Time

32

64

128

256

2 x 2 Layers

3 x 3 Layers

4 x 4 Layers

Figure. 46 Column Graph of GPU Execution between Layers

93

Figure. 47 2 x 2 Layers GPU Utilization Snapshot
From Figure. 47, 48, 49 we found that the computational power of GPU is harnessed only
by less than 1/3rd of the single GPU from the cluster. With the 2 x 2 layers, GPU
utilization is 18%, where with 3 x 3 layers it increased a little to 22%, then with 4 x 4
layers it increased to 31%. The GPU architecture is working on the principle of SIMD
vectorization.
SIMD processing exploits data-level parallelism. Data-level parallelism means that the
operations required to transform a set of vector elements can be performed on all
elements of the vector at the same time. That is, a single instruction can be applied to
multiple data elements in parallel.
94

Support for SIMD operations is pervasive in the Cell Broadband Engine. In the PPE, they
are supported by the Vector/SIMD Multimedia Extension instruction set. In the SPEs,
they are supported by the SPU instruction set.
In both the PPE and SPEs, vector registers hold multiple data elements as a single vector.
The data paths and registers supporting SIMD operations are 128 bits wide,
corresponding to four full 32-bit words. This means that four 32-bit words can be loaded
into a single register, and, for example, added to four other words in a different register in
a single operation.
The process of preparing a program for use on a vector processor is
called vectorization or SIMDization. It can be done manually by the programmer, or it
can be done by a compiler that does auto-vectorization. Here GPU does the autovectorization process which is supported by 16 CPU core so that only 25 – 30 %
computational power of GPU 0 is utilized where most of the CPU cores are utilizing
100% of their power which can be referenced from Figure. 50.

95

Figure. 48 3 x 3 Layers GPU Utilization Snapshot

96

Figure. 49 4 x 4 Layers GPU Utilization Snapshot

97

Figure. 50 3 x 3 Layers CPU Utilization Snapshot

98

6.4 Bidirectional Vs Non-Bidirectional Execution between Layers
De

Resi

Bidire

ep

dual

ctional

La

Lay

yer

ers

Hidden Layers

s
32

64

Predic Exec

Predic Exec

Predic Execu

Predic Exec

tion

ution

tion

ution

tion

tion

tion

ution

Accur

Time

Accur

Time

Accur

Time

Accur

Time

acy
3
3

3
3

TRUE

128

acy

256

acy

acy

0.910

5:18:

0.911

9:34:

0.913

6:44:2 0.182

16:0

07805

24

43537

22

13201

6

21921

0:55

FALS

0.932

2:47:

0.883

3:52:

0.920

6:03:3 0.182

21:5

E

13439

35

94976

52

93652

8

1:49

21921

Table. 9 Bidirectional vs Non-bidirectional layers execution time

99

Bidirectional Vs NonBidirectional
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Prediction
Accuracy

Execution
Time

32

Prediction
Accuracy

Execution
Time

64

Prediction
Accuracy

Execution
Time

128

3 3 TRUE

Prediction
Accuracy

Execution
Time

256

3 3 FALSE

Figure. 51 Column Graph of Execution time between bidirectional and non-bidirectional
In the Table. 9, execution time along with prediction accuracy is shown in 3 x 3 deep
residual layers where one learning is by bidirectional while the other learning is by nonbidirectional or unidirectional. When we observe the result of prediction accuracy in
Figure. 51, with 32 layers of hidden layer unidirectional layers gave better prediction
accuracy but as deep layers increased to 64 layers and 128 layers the bidirectional layers
gave better prediction accuracy but on the cost of higher execution time as it’s almost
twice the LSTM layers as compared to unidirectional LSTM. As the deep layers with
hidden layers increased to a certain point of threshold tensors both the iterations failed to
give the expected prediction accuracy. So from this result we found out for each
experiment there is a high efficiency level above which it doesn’t matter to the
architectural level.

100

6.5 Stack Bidirectional Vs Stack Non-Bidirectional Execution between
Layers
De

Resi

Bidire

ep

dual

ctional

La

Lay

yer

ers

Hidden Layers

s
32

64

Predic Exec

Predic Exec

Predic Execu

Predict Exec

tion

ution

tion

ution

tion

tion

ion

ution

Accur

Time

Accur

Time

Accur

Time

Accur

Time

acy
3
3

0
0

TRUE

0.887

128

acy
1:30:

0.917

256

acy
2:12:

0.902

acy
3:18:1 0.8927

68238 33

54329 40

95213 3

FALS

0.911

0.899

1:20:

0.919

E

77469 41

5589

42

23988 5

0:53:

72317

2:44:0 0.1822
19207

Table. 10 Stack Bidirectional vs Stack Non-bidirectional execution time

101

9:28:
29
7:59:
22

Deep Bidir vs Deep NonBidir
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Prediction Execution Prediction Execution Prediction Execution Prediction Execution
Accuracy
Time
Accuracy
Time
Accuracy
Time
Accuracy
Time
32

64
3 0 TRUE

128

256

3 0 FALSE

Figure. 52 Deep Bidirectional vs Deep Non-bidirectional Execution time

In the Table. 10, the experiment is carried out between stack layers or deep layers with
bidirectional and non-bidirectional communications. The Figure. 52 show that, the
prediction accuracy is better with unidirectional communication between layers with
lesser hidden layers, as the network goes more and more with more hidden layers the
bidirectional accuracy degrades. The bidirectional communication gives much better
result as compared to single directional communications in 256 hidden layers which is a
very good prediction accuracy.

102

6.6 Best Accuracy between all Layers
De

Resi

Bidire

ep

dual

ctional

La

Lay

yer

ers

Hidden Layers

s
32

64

Predic Exec

Predic Exec

Predic Execu

Predict Exec

tion

ution

tion

ution

tion

tion

ion

ution

Accur

Time

Accur

Time

Accur

Time

Accur

Time

acy
3
3
3
3

0
0
3
3

TRUE

0.887

128

acy
1:30:

0.917

256

acy
2:12:

0.902

acy
3:18:1 0.8927

68238 33

54329 40

95213 3

FALS

0.911

0.899

1:20:

0.919

E

77469 41

5589

42

23988 5

FALS

0.932

0.883

3:52:

0.920

E

13439 35

94976 52

93652 8

TRUE

0.910

0.911

0.913

0:53:
2:47:
5:18:

07805 24

9:34:

43537 22

2:44:0 0.1822
19207

6:03:3 0.1822

29
7:59:
22
21:5

19207

1:49

6:44:2 0.1822

16:0

13201 6

Table. 11 Best Accuracy in 3 deep layers

103

72317

9:28:

19207

0:55

Best Accuracy
1
0.8
0.6
0.4
0.2
0
Prediction
Accuracy

Execution
Time

Prediction
Accuracy

32

Execution
Time

64
3 0 TRUE

Prediction
Accuracy

Execution
Time

Prediction
Accuracy

128

3 0 FALSE

3 3 FALSE

Execution
Time

256
3 3 TRUE

Figure. 53 Best Accuracy among all types of 3 stacked layers
In Figure. 53, the prediction accuracy along with execution time of each 3 deep layers
with residual layer or without residual layer, with bidirectional or without bidirectional
layers is calculated to give an overview of better layer for this experiment. If we observe
the 32 hidden layers, 3 x 3 bidirectional false gives the best prediction accuracy with
good execution time, with 64 hidden layers 3 x 3 bidirectional gives better prediction
accuracy than any of them as proved in previous section, with 128 hidden layers, 3 x 3
non-bidirectional gives best accuracy but with higher execution time where 3 x 0 gives
almost similar prediction accuracy with half of the execution time, with 256 hidden 3 x 0
bidirectional layer is the clear winner as compare to others in terms of prediction
accuracy and execution time.

104

6.7 Between Deep Residual Bidirectional between 3 x3 and 4 x4
De

Resi

Bidire

ep

dual

ctional

La

Lay

yer

ers

Hidden Layers

s
32

64

128

256

Predic Exec

Predic Exec

Predic Exec

Predict Exec

tion

ution

tion

ution

tion

ution

ion

Accur

Time

Accur

Time

Accur

Time

Accura Time

acy
3

3

4

4

TRUE
TRUE

acy

acy

ution

cy

0.910

5:18:

0.911

9:34:

0.913

6:44:

0.1822

16:0

07805

24

43537

22

13201

26

19207

0:55

0.914

8:35:

0.875

9:14:

0.904

11:1

0.1822

20:5

48933

14

80591

42

30945

8:50

19207

0:11

Table. 12 Deep Residual 3 x3 vs 4 x 4 layers

Deep Residual Bidirectional Layer
1
0.8
0.6
0.4
0.2
0
Prediction
Accuracy
32

Execution
Time

Prediction
Accuracy

Execution
Time

Prediction
Accuracy

64

Execution
Time

128

3 3 TRUE

Prediction
Accuracy

256

4 4 TRUE

Figure. 54 Column graph of 3 x 3 vs 4 x 4 Deep Residual Layers
105

Execution
Time

In Figure. 54, we are trying to show how deep residual bidirectional layers going to
behave in a more complex network for this it is used all over the researches. We found
that with our test data and higher layers of bidirectional communication the execution
increases rapidly but the prediction accuracy dropped with giving the same test result as 3
x 3 layers. We might need more datasets with more complex architecture to test this
feature.

6.8 Between Deep Layer vs Prediction Accuracy vs Exe Time in CPU
Deep

Hidden Layers

Layer
s
32

64

128

256

Predicti

Execut

Predicti

Execut

Predicti

Execut

Predicti

Execut

on

ion

on

ion

on

ion

on

ion

Accura

Time

Accura

Time

Accura

Time

Accura

Time

cy

cy

cy

cy

2

0.92229

1:42:2

0.92466

2:09:2

0.87750

3:04:5

0.18221

5:49:4

Layer

3842

1

9147

2

2561

1

9207

0

3

0.91007

5:18:2

0.91143

9:34:2

0.91313

6:44:2

0.18221

16:00:

Layer

8049

4

5366

2

2012

6

9207

54

4

0.91448

8:35:1

0.87580

9:14:4

0.90430

11:18:

0.18221

20:50:

Layer

9329

4

5914

2

9452

50

9207

11

s

s

s
Table. 13 Execution matrix of all layers by CPU
106

Deep Layers Vs Prediction Accuracy Vs
Execution Time(CPU)
1
0.8
0.6
0.4
0.2
0
Prediction
Accuracy
32

Execution
Time

Prediction
Accuracy

Execution
Time

Prediction
Accuracy

64
2 Layers

Execution
Time

Prediction
Accuracy

128
3 Layers

Execution
Time

256

4 Layers

Figure. 55 CPU Execution Graph for all Layers

In the Figure. 55, we have shown the prediction accuracy along with execution time of
different layers which are 2 x 2 layers, 3 x 3 layers and 4 x 4 layers. With 32 hidden
layers, the 2 x 2 architecture performed better with higher prediction accuracy than other
layers, with 64 layers we observed 2 x 2 layers performance improved with good
prediction accuracy of 0.9246, with 128 hidden layers the 3 x 3 architecture started
performing better than others with 0.9131 prediction accuracy, with 256 hidden layers 4 x
4 architecture started giving better prediction accuracy than other layers which is 0.2143
but it’s still a low prediction accuracy. The execution of 4 x 4 layer is always higher as it
needs more hidden layers to iterate over deep learning model. This experiment concluded
that our research data performs expectedly well with 3 x 3 architecture. We might need a
bigger dataset to check the feasibility over 4 x 4 layer.
107

6.9 Between Deep Layer vs Prediction Accuracy vs Exe Time in GPU
Deep

Hidden Layers

Hidden Layers

Hidden Layers

Hidden Layers

32

64

128

256

Layer
s
Predicti

Executi

Predicti

Executi Predicti

Executi Predict Execu

on

on

on

on

on

on

ion

tion

Accurac Time

Accurac Time

Accurac Time

Accura Time

y

y

y

cy

2

0.90125

Layer

5488

2:37:24

0.93281

2:37:39 0.90464

3048

2:26:23 0.1822

8781

19207

2:40:0
5

s
3

0.89345

Layer

0975

6:06:51

0.92534

6:16:36 0.88327

7805

6:04:10 0.1822

1098

19207

6:09:3
4

s
4

0.90770

12:11:2

0.92025

11:48:2 0.18221

12:19:0 0.1822

12:30:

Layer

2744

2

7866

7

1

06

9207

s

Table. 14 Execution matrix of all layers by GPU

108

19207

Deep Layers Vs Prediction Accuracy Vs
Execution Time(GPU)
1
0.8
0.6
0.4
0.2
0
Prediction
Accuracy
32

Execution
Time

Prediction
Accuracy

Execution
Time

Prediction
Accuracy

64
2 Layers

Execution
Time

Prediction
Accuracy

128
3 Layers

Execution
Time

256

4 Layers

Figure. 56 GPU Execution Graph for all Layers
The Figure. 56, has shown the prediction accuracy and execution time between different
layers such as 2 x 2, 3 x 3, and 4 x 4 with 32, 64, 128 and 256 hidden layers with the
computational power of the GPU cluster. With GPU the 4 x 4 layer started performing
with higher prediction accuracy of 0.9077 than other layers in 32 layers hidden network,
with 64 hidden layers the 2 x2 layer gives better prediction accuracy of 0.9338 than other
layers, with 128 hidden layers the 2 x 2 results much better with good prediction accuracy
of 0.9046 and outstanding execution time which is overall same as explained in previous
section, with 256 layers of hidden network all layers failed to give an expected prediction
accuracy but the execution improved drastically over CPU. The execution time just
dropped to half with GPU cluster as compared to CPU.

109

6.10 Deep Layer CPU Execution
D

Re

Har

ee

sid

dw

p

ual

are

La

La

Hidden Layers

ye yer
rs

s
8

16

32

64

x

4

256

Pre

Ex

Pre

Ex

Pre

Ex

Pre

Ex

Pre

Ex

Pre

Ex

dict

ecu

dic

ecu

dict

ecu

dic

ecu

dict

ecu

dict

ecu

ion

tio

tio

tio

ion

tio

tio

tio

ion

tio

ion

tio

Acc

n

n

n

Acc

n

n

n

Acc

n

Acc

n

ura

Ti

Ac

Ti

ura

Ti

Ac

Ti

ura

Ti

ura

Ti

cy

me

cur

me

cy

me

cur

me

cy

me

cy

me

acy
4

128

acy

CP

0.9

7:5

0.9

7:5

0.9

8:3

0.8

9:1

0.9

11:

0.1

20:

U

182

8:2

04

8:4

144

5:1

75

4:4

043

18:

822

50:

218

1

98

6

893

4

80

2

094

50

192

11

91

81

29

1

59

52

1

Table. 15 4 x 4 deep Layers CPU execution matrix

110

07

4 x 4 CPU

8

16

32

64

Execution Time

Prediction Accuracy

Execution Time

Prediction Accuracy

Execution Time

Prediction Accuracy

Execution Time

Prediction Accuracy

Execution Time

Prediction Accuracy

Prediction Accuracy

Execution Time

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

128

256

Figure. 57 Column Graph of 4 x 4 deep Layers CPU execution
In Figure. 57, we have performed the experiment of 4 x 4 deep layers with very low
hidden layers to higher hidden layers to check the prediction accuracy. With 8 layers of
hidden layers, the accuracy is better as compared to other hidden layers network. As the
layers increase over time, the prediction accuracy started to drop. In this experiment, we
found that 16 layers and 128 layers the prediction accuracy is almost equivalent but the
execution time difference is increased to 25% more.

6.11 Lower GPU vs Higher CPU
Deep

Resid

Hard

Layer

ual

ware

s

Layer

Hidden Layers

s
32

64

128

256

Predi

Exec

Predi

Exec

Predi

Exec

Predi

Exec

ction

ution

ction

ution

ction

ution

ction

ution

111

Accur Time
acy
3x

3

GPU

acy

4

CPU

Accu

Time

racy

Accur Time
acy

0.893

6:06:

0.925

6:16:

0.88

6:04:

0.182

6:09:

4509

51

3478

36

3271

10

2192

33

75
4x

Accur Time

05

1

07

0.914

8:35:

0.875

9:14:

0.90

11:18

0.182

12:30

4893

14

8059

42

4309

:50

2192

:06

29

14

45

07

Table. 16 3 x 3 Layer GPU vs 4 x4 Layer CPU execution matrix
In Table. 16, it has shown the data of execution with 3 x 3 deep layers of GPU and 4 x 4
layers with CPU to evaluate the beneficial power of GPU over CPU for more complex
network with bigger dataset. We found that the prediction accuracy of 4 x 4 CPU is more
with 32 layers, then with 64 layers the 3 x 3 GPU gives better prediction accuracy then
CPU of 4 x 4, with 128 hidden layers the CPU performs better with higher prediction
accuracy of 0.9043 as compared to GPU with 2 times execution time taken from GPU
which is not good. In Figure. 58, by adding 256 layers the execution speed of 4 x 4 is
increased to double but failed to give an expected prediction accuracy on both the CPU
and GPU models.

112

Higher CPU vs Lower GPU
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Prediction Execution Prediction Execution Prediction Execution Prediction Execution
Accuracy
Time
Accuracy
Time
Accuracy
Time
Accuracy
Time
32

64

128

3 x 3 GPU

256

4 x 4 CPU

Figure. 58 Column Graph of 3 x 3 GPU vs 4 x 4 CPU Execution Result

6.12 4 x 4 CPU vs 4 x 4 GPU Layers
Deep

Resid

Layer ual
s

Hard

Hidden Layers

ware

Layer
s
32

4x

4

CPU

64

4

GPU

256

Predi

Exec

Predi

Exec

Predi Exec

Predi

Exec

ction

ution

ction

ution

ction

ction

ution

ution

Accur Time

Accur Time

Accu Time

Accur Time

acy

acy

racy

acy

0.914

8:35:

0.875

9:14:

0.90

11:18 0.182

20:50

4893

14

8059

42

4309

:50

:11

29
4x

128

14

45

2192
07

0.907

12:11 0.920

11:48 0.18

12:19 0.182

12:30

7027

:22

:27

:01

:06

2578
113

2219

2192

44

66

2

07

Table.17 4 x 4 Layers CPU vs GPU Execution

4 x 4 CPU vs GPU
1
0.8
0.6
0.4
0.2
0
Prediction Execution Prediction Execution Prediction Execution Prediction Execution
Accuracy
Time
Accuracy
Time
Accuracy
Time
Accuracy
Time
32

64

128

4 x 4 CPU

256

4x 4 GPU

Figure. 59 Graph of 4 x 4 deep layers CPU vs GPU Execution
In Figure. 59, we have shown the computational power of both CPU and GPU of a
complex deep learning model with 4 x 4 architecture. With 32 nodes of hidden layers,
the CPU performs better with higher prediction accuracy of 0.9144, with 64 nodes of
hidden layers the GPU is giving better result with 0.9205 with same execution of
previous hidden layers, with 128 layers CPU performance is 0.9043 which is outstanding
as compared to GPU which is just 0.18. When we went to 256 hidden layers, the GPU
and CPU in 4 x 4 failed to get expected results as both gave the same prediction accuracy
of 0.18, but the execution time of CPU is almost 100% more as compared to GPU
execution time. So in this experiment, 64 hidden layers GPU is the selected result which
is 92%.

114

6.13 Bidirectional Lower vs Stack Higher Layers
D

Resid

ee ual
p

Bidire Hidden

Hidden Layers

ctiona Layers

Hidden

Hidden Layers

Layers

Layer l

L
ay
er
s
32

2

2

x

64

x

3

256

Predi

Exec

Predic Execu Predi

Exec

Predict

Exec

ction

ution

tion

tion

ction

ution

ion

ution

Accu

Time

Accur

Time

Accu

Time

Accura Time

racy

acy

TRU

0.922 1:42:

0.924

2:09:

0.87

3:04:

0.1822

5:49:

E

2938

66914

22

7502

51

19207

40

21

4
3

128

racy

7

cy

56

FALS 0.932 2:47:

0.883

3:52:

0.92

6:03:

0.1822

21:51

E

94975

52

0936

38

19207

:49

1343
9

35

7

52

Table. 18 2 x 2 Bidirectional Stack Layer vs 3 x 3 Stack Layer
In the Figure. 60, we have shown the 2 x 2 bidirectional layers with 3 x 3 nonbidirectional layers which having almost same computational powers as 2 x 2 with
bidirectional gives 8 times of complex network over single LSTM cells where 3 x 3 nonbidirectional gives 9 times of complex network over single LSTM cell. With 32 hidden
layers, 3 x3 layers gives better accuracy over 2 x 2 layers where 64 layers, 2 x 2 layers
gives better result. When we consider the 128 hidden layers over 2 x 2 and 3 x 3 stacked
115

layers we found out that initial lower layers are used to learn the model and higher layers
are used to calculate the accuracy of the model in this case 3 x 3 having more initial
layers which gives a better learning to the model than 2 x 2 layers. In 256 layers, both the
models failed to perform but when you see the execution time 3 x 3 layers execution time
is very high as compared to 2 x 2 model which is almost 4 times of 2 x 2 models. So the
conclusion we draw that with higher layers the model will learn much faster but with
increase of complexity of hidden layers it will fail to pass the learning to the higher layers
which takes more time as it becomes very slow to pass the information.

Bidirectional vs Stack
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Prediction Execution Prediction Execution Prediction Execution Prediction Execution
Accuracy
Time
Accuracy
Time
Accuracy
Time
Accuracy
Time
32

64
2 x 2 TRUE

128

256

3 x 3 FALSE

Figure. 60 Graph of 2 x 2 Bidirectional Stack Layer vs 3 x 3 Stack Layer

116

6.14 Stack vs Hidden layer on Execution Time and Prediction Accuracy

Layers

Hidden Layers

Execution Time

Prediction Accuracy

2x2

32

1:42:21

0.922293842

2x2

64

2:09:22

0.924669147

2x2

128

3:04:51

0.877502561

2x2

256

5:49:40

0.182219207

4x4

8

7:58:20

0.918221891

4x4

16

7:58:46

0.90498811

4x4

32

8:35:14

0.914489329

4x4

64

9:14:42

0.875805914

Table. 19 2 x 2 stacked hidden layers vs 4 x 4 stacked hidden layers

Stack Layers vs Hidden Layers
100%

5:49:40
0.877502561
0.875805914
0.924669147
3:04:51 0.182219207
0.914489329
0.922293842
2:09:22
9:14:42
0.90498811
1:42:21
8:35:14
95%
0.918221891
7:58:46
90%
85%

32

64

128

256

7:58:20
32

16

64

8

80%
75%
2x2

2x2

2x2

Hidden Layers

2x2
Execution Time

4x4

4x4

4x4

Prediction Accuracy

Figure. 61 Execution time Graph of 2 x2 vs 4x 4 stacked layers
117

4x4

In the Table 19, we have shown the matrix of 2 x 2 stack layers with higher hidden layers
and compared with 4 x 4 stack layers with lower hidden layers which is having almost
same computational power over single layer. We have used 2 x 2 layers with 32, 64, 128
and 256 hidden layers and 4 x 4 layers with 8, 16, 32 and 64 hidden layers. As shown in
Figure 61, the biggest thing with higher layers is the execution time. The execution time
is very high with high layers with lower hidden layers. With 4 x 4 layers the execution
time is almost doubled as compared to the 2 x 2 layers. With time efficient, 2 x 2 layers
are the clear winners in this part of research as shown in Figure. 62. For prediction
accuracy, the 2 x 2 stacked layers with 32, 64 hidden layers gave better prediction
accuracy as compared to 4 x 4 stacked layers with 8, 16 layers of hidden layers. So we
concluded that lower models with higher hidden nodes provides better test results and
execution time as compared to higher layers with less hidden nodes.

118

Stacked Layers vs Hidden Layers
1

300

0.9
250

0.8
0.7

200

0.6
0.5

150

0.4
100

0.3
0.2

50

0.1
0

0
2x2

2x2

2x2

Hidden Layers

2x2

4x4

Prediction Accuracy

4x4

4x4

4x4

Execution Time

Figure. 62 Execution time graph with stack layers vs hidden layers

119

6.15 PyTorch vs TensorFlow Efficiency Comparison
Deep

Langu

Layers

age

Hidden Layers

32

64

128

256

Predicti Execut Predicti Exec

Predicti Exec

Predicti Exec

on

ion

on

ution

on

ution

on

ution

Accura

Time

Accura

Time

Accura

Time

Accura

Time

cy

cy

cy

cy

3

PyTor

0.9114

0:40:5

0.9022

0:51:

0.9182

2:21:

0.1822

6:58:

Layers

ch

87694

3

57844

39

66254

13

18305

31

3

Tensor

0.9117

0:53:4

0.8995

1:20:

0.9192

2:44:

0.1822

7:59:

Layers

Flow

74695

1

58902

42

39879

05

19207

22

Table.20 Efficiency between PyTorch and TensorFlow

120

PyTorch Vs TensorFlow
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Prediction
Accuracy
32

Execution
Time

Prediction
Accuracy

Execution
Time

Prediction
Accuracy

64
3 Layers PyTorch

Execution
Time

Prediction
Accuracy

128

Execution
Time

256

3 Layers TensorFlow

Figure. 63 Execution Graph between PyTorch and TensorFlow
In Table. 20, it shows the experiment result of 3 deep layered network with 32, 64, 128
and 256 hidden layers programmed using TensorFlow API and PyTorch API.
TensorFlow and PyTorch both are very good frameworks used by machine learning
researchers for building deep neural networks. The major difference is TensorFlow core
APIs are built using C++ and Python is used as wrapper on core to communicate with
data where PyTorch is built on top of Torch framework with python wrapper. The best
way to compare two frameworks is to code something up in both of them.
In Figure. 63, it displays our graphical representation of the Table. 20 data. We found out
PyTorch is executing much faster than TensorFlow. The execution time is always lower
than TensorFlow in all the hidden layers. The prediction accuracy is similar with
TensorFlow. During the whole experiment, in the 128 hidden layer network PyTorch
results marginally better prediction accuracy and less execution time than TensorFlow
framework so PyTorch is the winner.
121

6.16 Raspberry PI Cluster vs Intel Xeon CPU Efficiency Comparison

Dee Hard
p

Hidden Layers

ware

Lay
ers
32

64

128

256

Predicti

Exec

Predicti

Exec

Predicti

Exec

Predicti

Exec

on

ution

on

ution

on

ution

on

ution

Accura

Time

Accura

Time

Accura

Time

Accura

Time

cy

cy

cy

cy

3

PI

0.9217

1:43:

0.9049

2:45:

0.9221

4:14:

0.1822

10:19

Lay

Clust

09823

51

81205

39

89017

43

18305

:31

ers

er

3

Intel

0.9117

0:53:

0.8995

1:20:

0.9192

2:44:

0.1822

7:59:

Lay

Xeon

74695

41

58902

42

39879

05

19207

22

ers

CPU
Table.21 Efficiency between Raspberry Pi Cluster and Intel Xeon CPU

In this section, we did another experiment where we executed our LSTM deep learning
model with same UCI dataset but with different hardware. In the previous section, we
used the same dataset using same hardware but with different frameworks. One hardware
would be a 16 threads multicore Intel Xeon CPU processor with 32 GB of memory and
another hardware is 16 Raspberry Pi nodes cluster each having 1 GB RAM working in a
cluster fashion made by parameter server architecture.
122

In the Table.21, we noted all our experiment results. Figure 64 is the graphical
representation of our experimental result. We observed that in distributed machine
learning, the accuracy of model is improved with more execution time. With 32 hidden
layers, PI cluster give 92% accuracy in 1hour 43 minutes execution time where Intel CPU
give 91% accuracy with 53 minutes execution time. The cluster is giving better prediction
accuracy with high execution time than multicore Intel CPU. This is might be due to
execution throughput with 16 PIs connecting together during the execution. With 256
layers, the accuracy is equivalent on both the hardware which just 18% but the execution
time of cluster is higher than single Intel Xeon CPU. So we can draw the conclusion that
with high power GPU clustered distributed machines this could be an efficient
performance improvement which needs further research.

Raspberry PI Cluster Vs CPU
1
0.8
0.6
0.4
0.2
0
Prediction
Accuracy
32

Execution
Time

Prediction
Accuracy

Execution
Time

Prediction
Accuracy

64
3 Layers PI Cluster

Execution
Time

128

Prediction
Accuracy

256

3 Layers Intel Xeon CPU

Figure.64 Execution graph between Pi Cluster vs Intel Xeon CPU

123

Execution
Time

CHAPTER VII
CONCLUSION
7.1 Summary
In this thesis, we proposed a distributed deep learning model to solve a Human Activity
Recognition (HAR) problem. We focused on the deep learning model using
asynchronous parameter server architecture as well synchronous all-reduce approach. For
this purpose, we have created the Raspberry Pi cluster using 16 Raspberry Pi nodes and
the NVIDIA GPU cluster having 3 NVIDIA GPUs where both the systems are tested
with distributed approach by using distributed TensorFlow API and PyTorch API. To
work on the HAR problem, we have created a Residual Bidirectional LSTM to simulate
HAR by using this distributed system. There are several points that we tested in this
research thesis. First we created a multilayer deep learning model with 2 x 2, 3 x 3 and 4
x 4 architecture where stacked layers are compared with non-stacked layers, stacked
layers are compared with residual bidirectional layers, residual bidirectional layers are
compared with residual non-bidirectional layers. Those layers with different variations
were executed over CPUs and GPUs to benchmark the performance. The small stacked
layers with heavy hidden layers are compared with high stacked layers with less hidden
layers on both CPU and GPU.

124

In this experiment, when we evaluated the execution times along with prediction
accuracy over the multicore CPU and the GPU cluster using different programming APIs:
TensorFlow and PyTorch. Note that the execution time of PyTorch over 3 x3 layer is
faster than TensorFlow over the same layer. We have also tested the distributed
TensorFlow framework in Raspberry Pi cluster to benchmark the CPU performance of
16- node Raspberry Pi cluster, where each Pi having 1 GB RAM, all-together 16 GB
RAM equivalent with 32 GB Octacore Intel Xeon CPU. When the result of CPU
computation of 2 x 2 layers over the multicore CPU is compared with the Raspberry Pi
cluster, we found that in distributed network the execution time is almost twice than the
multicore system due to high latency and low throughput.
After comparing all experiments, the result show that the implementation of distributed
TensorFlow on the GPU cluster works much faster than on the multicore CPU for high
number of stacked layers with multi hidden layers. But it takes approximately same time
for less stacked layers and dense stacked layers. For lesser stacked layers, GPU
computation is less efficient. The CPU computation gives better prediction accuracy with
3 x 3 stacked layers but execution time is 3 times slower than the GPU cluster

7.2 Future Work
There will be continuous efforts on developing PyTorch and TensorFlow programming
frameworks with different computing structures. How PyTorch will behave with the GPU
clusters would be another interesting study. There are multiple distributed architectures
that need to be tested. This would not only require changes in the programming structure,
but would also need more sophisticated multi GPU cluster hardware machines. For
achieving efficiency in terms of optimal data movement this attempt would require
multiple GPU units physically connected to each other like Raspberry Pi cluster or
connected over the internet where bandwidth would be another parameter to do research.
Due to limitation of GPUs with more machines, multi machine GPU cluster could be a
125

future work. With a Network Attached Storage (NAS) server along with Raspberry Pi
cluster would become a more effective solution for storage problems. The distributed
deep learning is just the beginning of a new dimension of research with massive scales
datasets from different geographical areas.

126

APPENDIX A – TESTBED ARCHITECTURE
A.1 NVIDIA GPU MACHINE SETUP
In this single machine cluster, there are 3 NVIDIA GPU cards used. There GPUs are
taken from different old machines and put together in this machine for clustering purpose.
1. NVIDIA Tesla K40c
2. NVIDIA Quadro p5000
3. NVIDIA K640
Different GPUs might be installed with different drivers in different machine. Those
should be under one driver in a single machine cluster which should support all the
GPUs.
The command to check if any driver already installed in your machine.
$ ubuntu-drivers devices
The command to show all NVIDIA drivers in your machine.
$ lspci -v | grep NVIDIA
Step-1 : Remove previous installations
The command removes if any older driver already installed.
$ sudo apt-get purge nvidia*
The command removes CUDA installation along with drivers as well.
$ sudo apt-get autoremove
The command checks what NVIDIA GPU cards the machine having as shown in Figure.
$ sudo lshw -c display
127

Figure.65 NVIDIA GPU Cards

You can see in the Figure.65, those are default “driver= nouveau” that means NVIDIA
driver is not installed in this machine.

There are 2 ways to install NVIDIA driver in machine.
First one is to install from PPA drivers which is third party compatible with all NVIDIA
GPUs. The Second one is installing from NVIDIA website by manually checking each
GPU model with driver compatibility.
The advantage of ppa is easy and it automatically keeps updating if the creator adds new
versions. Ubuntu integrates video into kernel with dpkg. If you install directly from
NVIDIA, you still have to manually rerun that part of install task with each kernel update
otherwise video stops working. With PPA it’s automatic. That’s why you don’t see it in
synaptic nor dpkg commands.
128

Step-2: Download the Driver (With NVIDIA Driver)
https://www.nvidia.com/content/DriverDownloadMarch2009/confirmation.php?url=/tesla/410.104/NVIDIA-Linux-x86_64410.104.run&lang=us&type=Tesla
In the website, you need to choose the required driver for the installed GPUs in your
machine by the below page as shown in Figure.66.

Figure.66 NVIDIA Driver Repository

As I have 3 different GPUs, Tesla K40c is compatible with NVIDIA-Linux-x86_64410.104. Quadro p5000 & Quadro K620 are compatible with NVIDIA-Linux-x86_64418.56. For all 3 GPUs, I am taking the 410.104 as base driver version.
Step-3: Build Essential Dependencies
1. build-essentials – For building drivers

129

2. dkms – For providing dkms support, DKMS is for packages that provide a kernel
module in source form (or binary with a source wrapper), so they don’t have to update
the module for every kernel rebuild.
3. gcc-multilib – For providing 32-bit support
4. xorg and xorg-dev – For graphic display on a workstation with GUI (If not installed)
Check with command: $ sudo X -version

Figure.67 Graphics Display
Please run the command: $ sudo apt-get install build-essential gcc-multilib dkms
Step-4: Disable default nouveau
Please note that nouveau drivers manual removal is required only if you are going to
install the proprietary NVIDIA drivers. If not after NVIDIA driver installation, nouveau
may cause blurry screens. As we have NVIDIA GPUs, we need to remove it before
installing NVIDIA drivers.

1. Please create a file. Please follow the command below.
$ sudo gedit /etc/modprobe.d/blacklist-nouveau.conf
2. Please add below contents in it
130

blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
Please verify the file with contents by below command
cat /etc/modprobe.d/blacklist-nouveau.conf

Step-5: Update the initramfs
It needs to update the initramfs which might be configured to load the nouveau drivers.
The update-initramfs script manages your initramfs images on your local box. It keeps
track of the existing initramfs archives in /boot. There are three modes of operation
create, update or delete. You must at least specify one of those modes.
Please run the command below.
$ sudo update-initramfs -u
It will give confirmation with below line.
update-initramfs: Generating /boot/initrd.img-4.18.0-15-generic
Please reboot the machine to proceed further.
Step-6: Stop Desktop Manager

131

After computer is rebooted, we need to stop the desktop manager before executing the
runfile to install the driver. lightdm is the default manager in Ubuntu. If GNOME or KDE
desktop environment is used, then desktop manager would be gdm or kdm.
To find the running session in your machine please use below command.
$ echo 'Desktop: %s\nSession: %s\n'"$XDG_CURRENT_DESKTOP"
"$GDMSESSION"

Figure.68 GDM Session
Please run the command to stop gdm service.
$ sudo service gdm stop
In order to install new NVIDIA driver we need to stop the current display server. The
easiest way to do this is to change into runlevel 3 using telinit command. After this
command, the display server will stop, therefore make sure to save all current work
before proceed.
Please run the command below.
$ sudo telinit 3
Step-6: Install the driver
cd $HOME
sudo chmod +x NVIDIA-Linux-x86_64-410.104.run
sudo ./NVIDIA-Linux-x86_64-410.104.run --dkms -s
Step-7: Check Installation by using below command.
132

$ nvidia-smi
As shown in Figure.68, after successful installation, it will report all CUDA capable
devices in your system.

Figure.69 NVIDIA Driver Successful Installation Snapshot

Step -2: (Alternative of above with PPA)
1. Add the Official NVIDIA PPA to Ubuntu and update it.
$ sudo add-apt-repository ppa:graphics-drivers/ppa
$ sudo apt update
133

2. Please check with below command which driver is required to install.
$ ubuntu-drivers devices

Figure.70 Ubuntu Driver Display
In Figure. 70, it clearly recommends nvidia-driver-418, but for hassle free environment,
we have installed 410.
3. Install the recommended NVIDIA Driver.
$ sudo apt install nvidia-driver-410
Step-3: Install CUDA Toolkit
Pre-Installation Actions
1. Please verify whether you have a CUDA capable GPU.
$ lspci | grep -i nvidia
2. Please verify whether you have a supported version of Linux.
$ uname -m && cat /etc/*release
134

3. Please verify the system has gcc installed.
$ gcc –version
4. Please verify if the system has correct kernel header installed
$ uname -r
5. Please run the command to install updated kernel header.
$ sudo apt-get install linux-headers-$(uname -r)
6. Please select below link to download the CUDA as in Figure.71.
https://developer.nvidia.com/cuda-downloads

Figure. 71 CUDA Toolkit
7. Install repository meta-data
$ sudo dpkg -i cuda-repo-__.deb
8. Installing the CUDA public GPG key (Installing the local repo)
$ sudo apt-key add /var/cuda-repo-/7fa2af80.pub
9. Update the Apt repository cache
135

$ sudo apt-get update
10. Install CUDA
$ sudo apt-get install cuda
11. Set Environment path (Post Installation)
1. Take backup of existing bashrc file.
2. Go to the home directory.
cd $HOME
3. Open the .bashrc file
sudo gedit .bashrc
4. Add following two commands in .bashrc file.
export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
5. Save and close the .bashrc file .
6. Restart the machine.
Verification Action already mentioned in the Implementation section.
Step-4: Install cuDNN
1. Go to the cuDNN download page (need registration) and select the latest cuDNN 7.5
version made for CUDA 10.0.
Please use the link below.
https://developer.nvidia.com/rdp/cudnn-download
136

2. Download all 3 .deb files: the runtime library, the developer library, and the code
samples library for Ubuntu 18.04.
3. Install them in the same order:
sudo dpkg -i libcudnn7_7.5.0.56-1+cuda10.0_amd64.deb (the runtime library)
sudo dpkg -i libcudnn7-dev_7.5.0.56-1+cuda10.0_amd64.deb (the developer library)
sudo dpkg -i libcudnn7-doc_7.5.0.56-1+cuda10.0_amd64.deb (the code samples)
4. The verification process is mentioned in the Implementation section.
Step-5: Install lipcupti-dev
1. Please use the below command.
sudo apt-get install libcupti-dev
2. Please add the below line in the bashrc file for environment setup. Use below
command. (Please take a backup of bashrc file)
echo 'export
LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH'
>> ~/.bashrc
Please follow the given link for more details.
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

137

A.2 PROGRAM MACHINE SETUP

The HAR program contains multiple files. The 3 important files are
1. lstm_architecture.py
2. Config_Dataset_HAR.py
3. Config_Dataset_HAR.ipynb (Jupyter Notebook)
4. download_datasets.py
The “data” directory needs to be created manually with 775 access inside the
environment which is mentioned in the Config_Dataset_HAR as path for training and
testing data samples of HAR. The folder structure mentioned in the Figure. 8 will
automatically established by the Config file once it gets the data folder. The
download_dataset.py will load the UCI repository file for the first time from website and
put in the data directory. This file is places inside the data folder.
The first section of Config_Dataset_HAR.ipynb file will set the path of data, call the
download_datasets.py script to load the data and will create necessary directory structure
for the program. It needs to run only once for the whole program.
The second section of Config_Dataset_HAR.ipynb file on running creates
X_train_signals_paths and X_test_signal_paths with proper folder structure.
The load_X and load_y methods takes the input parameters of signal_paths and returns
ndarray which is tensor of features of both training and testing. Basically it prepares the
datasets for training and testing by the deep learning model.
The file lstm_architecture.py contains all the different types of LSTM functions which
138

are fed by the dataset from the Config_Dataset_HAR.py file with a window of 128
timesteps. The input of HAR should be a time series, and the basic structure of the
LSTM guarantees that it can preserve the characteristics on the temporal dimension.
The below input parameters used for different features.
self.training_epochs is the number of iterations the model will run.
self.learning_rate is the parameter which decides what would be the learning rate of the
model.
self.n_hidden is the parameter which decides how many hidden layers will be developed
by the model for experiment.
self.use_bidirectionnal_cells is the parameter which decided cells will do bidirectional
communication or not.
n_layers_in_highway parameter decides how many residual layers would be there in the
model.
n_stacked_layers parameter decides how many deep-stacked layers would be there in
the model.
OneHotEncoder
A one hot encoding is a representation of categorical variables as binary vectors. This
first requires that the categorical values be mapped to integer values. Then, each integer
value is represented as a binary vector that is all zero values except the index of the
integer, which is marked with a 1. We have not used any API for this, we have used the
manual process. The function one_hot(y) converts labels from dense to one hot layer.
For example it takes [[5], [0], [3]] as input array and returns [[0, 0, 0, 0, 0, 1], [1, 0, 0, 0,
0, 0], [0, 0, 0, 1, 0, 0]] as output.
139

L2 Regularization
We have used L2 regularization in the context of Stochastic Gradient Descent in Neural
Network.

Figure.72 An L2-regularized version of the cost function used in SGD of RNN
Generally in Machine Learning, when we fit our model we search the solution space for
the most fitting solution; In the context of Neural Networks, the solution space can be
thought of as the space of all functions our network can represent. We know that the size
of this space depends on the depth of the network and the activation functions used. We
also know that with at least one hidden layer followed by an activation layer using a
“squashing” function, this space is very large, and that it grows exponentially with the
depth of the network (e.g the universal approximation theorem).
When we are using Stochastic Gradient Descent (SGD) to fit our network’s parameters to
the learning problem at hand, we take, at each iteration of the algorithm, a step in the
solution space towards the gradient of the loss function J(θ; X, y) in respect to the
network’s parameters θ. Since the solution space of deep neural networks is very rich,
this method of learning might overfit to our training data. This overfitting may result in
significant generalization error and bad performance on test data, in the context of model
development, if no counter-measure is used. Those counter-measures are called
regularization techniques.
Additionally, a large network can be optimized correctly for a problem with sufficient
140

regularization, such as L2 weight decay and dropout. However, if no regularization is
used, results trend to overfitting and bad operations on the test set. Complexity is good
but only if countered with regularization. Too many layers and cells per layer will
increase the computational complexity and waste computational resources. When the
layer number and cell number reach a certain scale, the recognition accuracy will remain
at a certain scale instead of increasing. By adding more depth, regularization is then
needed to avoid overfitting while still improving accuracy. The L2 norm of the weights
for weight decay is added in the loss function in our deep learning model.
Our deep LSTM neural network is limited in terms of how many data points it can
access: it has access to only 128 time steps when making its predictions. Especially when
deepened, the next forward/backward duo will see output from the other pass “in
advance”, because, logically, a backward pass for our bidirectional LSTM reverses the
input and the output before the concatenation. Thus, the Bidir-LSTM has the same input
and output shape as the baseline LSTM. But at a given time step, it has access to more
information in advance because of the backward passes.
Activation Function

In our network, the activity function is unified with ReLU, because it always outperforms
with deep networks to counter gradient vanishing. Using it’s recommended to use RELU/
leaky RELU as the activation function, as it is relatively robust to the vanishing/
exploding gradient issue (especially for networks that are not too deep). Although the
output is a tensor for a given time window, the time axis has been crunched by the neural
network. That is, we need only the last element of the output and can discard the others.
Thus, only the gradient from the prediction at the last time step is applied. This also
causes a LSTM cell to be unnecessary: the uppermost backward LSTM in the
bidirectional pass. Hopefully, this is not of great concern because TensorFlow should
141

evaluate what to compute and what not to compute. Additionally, the training dataset
should be shuffled during the training process. The state of the neural network is reset at
each new window for each new prediction. In our experiment, 3 x 3 residual bidirectional
LSTM out-performing other LSTM models with 2 x2 and 4 x 4 architecture. The 3 x 3
could be thought of 18 LSTM cells working in a network.

Adam Optimizer
Adam is an adaptive learning rate optimization algorithm that’s been designed
specifically for training deep neural networks. First published in 2014, Adam was
presented at ICLR 2015 conference for deep learning practitioners. Adam is an adaptive
learning rate method, which means, it computes individual learning rates for different
parameters. Its name is derived from adaptive moment estimation, and the reason it’s
called that is because Adam uses estimations of first and second moments of gradient to
adapt the learning rate for each weight of the neural network.
Dropout
self.keep_prob_for_dropout is the parameter which specifies the dropout in the model.
dropout is applied between each layer on the depth axis or, sometimes, just at the output,
depending on what is specified in the configuration file, which is another hyperparameter. Dropout refers to the fact that parts of tensors that are output by the hidden
layer are shut down to a zero value to a certain probability for each value in each training
epoch, while other values scale up accordingly to keep the same geometric norm of the
tensor’s values. The inoperative nodes can be regarded as dead nodes (or neurons) that
are temporarily not in the network, which means that the weights and biases behind these
dead notes temporarily neither learns nor contributes to the predictions during that
training step for a batch. The weights are kept intact.

142

APPENDIX B – SOURCE CODE

B.1 TensorFlow Code
download_dataset.py
# !wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00240/UCI HAR
Dataset.zip"
# !wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00240/UCI HAR
Dataset.names"
# import copy
import os
from subprocess import call
print("")
print("Downloading UCI HAR Dataset...")
if not os.path.exists("UCI HAR Dataset.zip"):
call(
'wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00240/UCI HAR
Dataset.zip"',
shell=True
)
print("Downloading done.\n")
else:
143

print("Dataset already downloaded. Did not download twice.\n")

print("Extracting...")
extract_directory = os.path.abspath("UCI HAR Dataset")
if not os.path.exists(extract_directory):
call(
'unzip -nq "UCI HAR Dataset.zip"',
shell=True
)
print("Extracting successfully done to {}.".format(extract_directory))
else:
print("Dataset already extracted. Did not extract twice.\n")
lstm_architecture.py
__author__ = 'jk_ranbir'
import tensorflow as tf
from sklearn import metrics
from sklearn.utils import shuffle
import numpy as np
from datetime import datetime
import time

def one_hot(y):
"""convert label from dense to one hot
argument:
144

label: ndarray dense label ,shape: [sample_num,1]
return:
one_hot_label: ndarray one hot, shape: [sample_num,n_class]
"""
# e.g.: [[5], [0], [3]] --> [[0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0]]
y = y.reshape(len(y))
n_values = np.max(y) + 1
return np.eye(n_values)[np.array(y, dtype=np.int32)] # Returns FLOATS

def batch_norm(input_tensor, config, i):
# Implementing batch normalisation: this is used out of the residual layers
# to normalise those output neurons by mean and standard deviation.
if config.n_layers_in_highway == 0:
# There is no residual layers, no need for batch_norm:
return input_tensor
with tf.variable_scope("batch_norm") as scope:
if i != 0:
# Do not create extra variables for each time step
scope.reuse_variables()
# Mean and variance normalisation simply crunched over all axes
axes = list(range(len(input_tensor.get_shape())))
mean, variance = tf.nn.moments(input_tensor, axes=axes, shift=None, name=None,
145

keep_dims=False)
stdev = tf.sqrt(variance+0.001)
# Rescaling
bn = input_tensor - mean
bn /= stdev
# Learnable extra rescaling
# tf.get_variable("relu_fc_weights", initializer=tf.random_normal(mean=0.0,
stddev=0.0)
bn *= tf.get_variable("a_noreg", initializer=tf.random_normal([1], mean=0.5,
stddev=0.0))
bn += tf.get_variable("b_noreg", initializer=tf.random_normal([1], mean=0.0,
stddev=0.0))
# bn *= tf.Variable(0.5, name=(scope.name + "/a_noreg"))
# bn += tf.Variable(0.0, name=(scope.name + "/b_noreg"))
return bn
def relu_fc(input_2D_tensor_list, features_len, new_features_len, config):
"""make a relu fully-connected layer, mainly change the shape of tensor
both input and output is a list of tensor
argument:
input_2D_tensor_list: list shape is [batch_size,feature_num]
features_len: int the initial features length of input_2D_tensor
new_feature_len: int the final features length of output_2D_tensor
config: Config used for weights initializers
return:
146

output_2D_tensor_list lit shape is [batch_size,new_feature_len]
"""
W = tf.get_variable(
"relu_fc_weights",
initializer=tf.random_normal(
[features_len, new_features_len],
mean=0.0,
stddev=float(config.weights_stddev)
)
)
b = tf.get_variable(
"relu_fc_biases_noreg",
initializer=tf.random_normal(
[new_features_len],
mean=float(config.bias_mean),
stddev=float(config.weights_stddev)
)
)
# intra-timestep multiplication:
output_2D_tensor_list = [
tf.nn.relu(tf.matmul(input_2D_tensor, W) + b)
for input_2D_tensor in input_2D_tensor_list
]
return output_2D_tensor_list
147

def single_LSTM_cell(input_hidden_tensor, n_outputs):
""" define the basic LSTM layer
argument:
input_hidden_tensor: list a list of tensor,
shape: time_steps*[batch_size,n_inputs]
n_outputs: int num of LSTM layer output
return:
outputs: list a time_steps list of tensor,
shape: time_steps*[batch_size,n_outputs]
"""
with tf.variable_scope("lstm_cell"):
lstm_cell = tf.nn.rnn_cell.LSTMCell(n_outputs, state_is_tuple=True,
forget_bias=0.999)
outputs, _ = tf.nn.static_rnn(lstm_cell, input_hidden_tensor, dtype=tf.float32)
return outputs

def bi_LSTM_cell(input_hidden_tensor, n_inputs, n_outputs, config):
"""build bi-LSTM, concatenating the two directions in an inner manner.
argument:
input_hidden_tensor: list a time_steps series of tensor, shape: [sample_num,
n_inputs]
n_inputs: int units of input tensor
n_outputs: int units of output tensor, each bi-LSTM will have half those internal
units
config: Config used for the relu_fc
return:
148

layer_hidden_outputs: list a time_steps series of tensor, shape: [sample_num,
n_outputs]
"""
n_outputs = int(n_outputs/2)
print ("bidir:")
with tf.variable_scope('pass_forward') as scope2:
hidden_forward = relu_fc(input_hidden_tensor, n_inputs, n_outputs, config)
forward = single_LSTM_cell(hidden_forward, n_outputs)
print (len(hidden_forward), str(hidden_forward[0].get_shape()))
# Backward pass is as simple as surrounding the cell with a double inversion:
with tf.variable_scope('pass_backward') as scope2:
hidden_backward = relu_fc(input_hidden_tensor, n_inputs, n_outputs, config)
backward = list(reversed(single_LSTM_cell(list(reversed(hidden_backward)),
n_outputs)))
with tf.variable_scope('bidir_concat') as scope:
# Simply concatenating cells' outputs at each timesteps on the innermost
# dimension, like if the two cells acted as one cell
# with twice the n_hidden size:
layer_hidden_outputs = [
tf.concat([f, b], len(f.get_shape()) - 1)
for f, b in zip(forward, backward)]
return layer_hidden_outputs
149

def residual_bidirectional_LSTM_layers(input_hidden_tensor, n_input, n_output,
layer_level, config, keep_prob_for_dropout):
"""This architecture is only enabled if "config.n_layers_in_highway" has a
value only greater than int(0). The arguments are same than for bi_LSTM_cell.
arguments:
input_hidden_tensor: list a time_steps series of tensor, shape: [sample_num,
n_inputs]
n_inputs: int units of input tensor
n_outputs: int units of output tensor, each bi-LSTM will have half those internal
units
config: Config used for determining if there are residual connections and if yes, their
number and with some batch_norm.
return:
layer_hidden_outputs: list a time_steps series of tensor, shape: [sample_num,
n_outputs]
"""
with tf.variable_scope('layer_{}'.format(layer_level)) as scope:
if config.use_bidirectionnal_cells:
get_lstm = lambda input_tensor: bi_LSTM_cell(input_tensor, n_input, n_output,
config)
else:
get_lstm = lambda input_tensor: single_LSTM_cell(relu_fc(input_tensor,
n_input, n_output, config), n_output)
def add_highway_redisual(layer, residual_minilayer):
return [a + b for a, b in zip(layer, residual_minilayer)]
150

hidden_LSTM_layer = get_lstm(input_hidden_tensor)
# Adding K new (residual bidir) connections to this first layer:
for i in range(config.n_layers_in_highway - 1):
with tf.variable_scope('LSTM_residual_{}'.format(i)) as scope2:
hidden_LSTM_layer = add_highway_redisual(
hidden_LSTM_layer,
get_lstm(input_hidden_tensor)
)
if config.also_add_dropout_between_stacked_cells:
hidden_LSTM_layer = [tf.nn.dropout(out, keep_prob_for_dropout) for out in
hidden_LSTM_layer]
return [batch_norm(out, config, i) for i, out in enumerate(hidden_LSTM_layer)]

def LSTM_network(feature_mat, config, keep_prob_for_dropout):
"""model a LSTM Network,
it stacks 2 LSTM layers, each layer has n_hidden=32 cells
and 1 output layer, it is a full connet layer
argument:
feature_mat: ndarray fature matrix, shape=[batch_size,time_steps,n_inputs]
config: class containing config of network
return:
: ndarray output shape [batch_size, n_classes]
"""
151

with tf.variable_scope('LSTM_network') as scope: # TensorFlow graph naming
feature_mat = tf.nn.dropout(feature_mat, keep_prob_for_dropout)
# Exchange dim 1 and dim 0
feature_mat = tf.transpose(feature_mat, [1, 0, 2])
print (feature_mat.get_shape())
# New feature_mat's shape: [time_steps, batch_size, n_inputs]
# Temporarily crush the feature_mat's dimensions
feature_mat = tf.reshape(feature_mat, [-1, config.n_inputs])
print (feature_mat.get_shape())
# New feature_mat's shape: [time_steps*batch_size, n_inputs]
# Split the series because the rnn cell needs time_steps features, each of shape:
hidden = tf.split(feature_mat, config.n_steps, 0)
print (len(hidden), str(hidden[0].get_shape()))
# New shape: a list of lenght "time_step" containing tensors of shape [batch_size,
n_hidden]
# Stacking LSTM cells, at least one is stacked:
print ("\nCreating hidden #1:")
hidden = residual_bidirectional_LSTM_layers(hidden, config.n_inputs,
config.n_hidden, 1, config, keep_prob_for_dropout)
print (len(hidden), str(hidden[0].get_shape()))
for stacked_hidden_index in range(config.n_stacked_layers - 1):
# If the config permits it, we stack more lstm cells:
152

print ("\nCreating hidden #{}:".format(stacked_hidden_index+2))
hidden = residual_bidirectional_LSTM_layers(hidden, config.n_hidden,
config.n_hidden, stacked_hidden_index+2, config, keep_prob_for_dropout)
print (len(hidden), str(hidden[0].get_shape()))
print ("")
# Final fully-connected activation logits
# Get the last output tensor of the inner loop output series, of shape [batch_size,
n_classes]
last_hidden = tf.nn.dropout(hidden[-1], keep_prob_for_dropout)
last_logits = relu_fc(
[last_hidden],
config.n_hidden, config.n_classes, config
)[0]
return last_logits

def run_with_config(Config, X_train, y_train, X_test, y_test):
start_time = datetime.now()
print ("Start Time: ",time.ctime())
print ("")
tf.reset_default_graph() # To enable to run multiple things in a loop
#----------------------------------# Define parameters for model
#----------------------------------config = Config(X_train, X_test)
153

print("Some useful info to get an insight on dataset's shape and normalisation:")
print("features shape, labels shape, each features mean, each features standard
deviation")
print(X_test.shape, y_test.shape,
np.mean(X_test), np.std(X_test))
print("the dataset is therefore properly normalised, as expected.")
#-----------------------------------------------------# Let's get serious and build the neural network
#-----------------------------------------------------with tf.device("/cpu:0"): # Remove this line to use GPU. If you have a too small GPU,
it crashes.
#with tf.device('/gpu:0'):
#with tf.device('/gpu:1'):
#mirrored_strategy = tf.contrib.distribute.MirroredStrategy(devices=["/gpu:0",
"/gpu:1"])
#with mirrored_strategy.scope():
X = tf.placeholder(tf.float32, [
None, config.n_steps, config.n_inputs], name="X")
Y = tf.placeholder(tf.float32, [
None, config.n_classes], name="Y")
# is_train for dropout control:
is_train = tf.placeholder(tf.bool, name="is_train")
keep_prob_for_dropout = tf.cond(is_train,
lambda: tf.constant(
config.keep_prob_for_dropout,
name="keep_prob_for_dropout"
154

),
lambda: tf.constant(
1.0,
name="keep_prob_for_dropout"
)
)
pred_y = LSTM_network(X, config, keep_prob_for_dropout)
# Loss, optimizer, evaluation
# Softmax loss with L2 and L1 layer-wise regularisation
print ("Unregularised variables:")
for unreg in [tf_var.name for tf_var in tf.trainable_variables() if ("noreg" in
tf_var.name or "Bias" in tf_var.name)]:
print (unreg)
l2 = config.lambda_loss_amount * sum(
tf.nn.l2_loss(tf_var)
for tf_var in tf.trainable_variables()
if not ("noreg" in tf_var.name or "Bias" in tf_var.name)
)
# first_weights = [w for w in tf.all_variables() if w.name ==
'LSTM_network/layer_1/pass_forward/relu_fc_weights:0'][0]
# l1 = config.lambda_loss_amount * tf.reduce_mean(tf.abs(first_weights))
loss = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits_v2(logits=pred_y,labels=Y)) + l2 # +
l1
155

# Gradient clipping Adam optimizer with gradient noise
optimize = tf.contrib.layers.optimize_loss(
loss,
global_step=tf.Variable(0),
learning_rate=config.learning_rate,
optimizer=tf.train.AdamOptimizer(learning_rate=config.learning_rate),
clip_gradients=config.clip_gradients,
gradient_noise_scale=config.gradient_noise_scale
)
correct_pred = tf.equal(tf.argmax(pred_y, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, dtype=tf.float32))
#-------------------------------------------# Hooray, now train the neural network
#-------------------------------------------# Note that log_device_placement can be turned of for less console spam.
#sessconfig = tf.ConfigProto(log_device_placement=False)
sessconfig = tf.ConfigProto(allow_soft_placement =
True,log_device_placement=False)
#sessconfig.gpu_options.allow_growth = True
with tf.Session(config=sessconfig) as sess:
#init = tf.global_variables_initializer()
sess.run(tf.global_variables_initializer())
best_accuracy = (0.0, "iter: -1")
best_f1_score = (0.0, "iter: -1")
156

# Start training for each batch and loop epochs
worst_batches = []
for i in range(config.training_epochs):
# Loop batches for an epoch:
shuffled_X, shuffled_y = shuffle(X_train, y_train, random_state=i*42)
for start, end in zip(range(0, config.train_count, config.batch_size),
range(config.batch_size, config.train_count + 1, config.batch_size)):
_, train_acc, train_loss, train_pred = sess.run(
[optimize, accuracy, loss, pred_y],
feed_dict={
X: shuffled_X[start:end],
Y: shuffled_y[start:end],
is_train: True
}
)
worst_batches.append(
(train_loss, shuffled_X[start:end], shuffled_y[start:end])
)
worst_batches = list(sorted(worst_batches))[-5:] # Keep 5 poorest
# Train F1 score is not on boosting
train_f1_score = metrics.f1_score(
157

shuffled_y[start:end].argmax(1), train_pred.argmax(1), average="weighted"
)
# Retrain on top worst batches of this epoch (boosting):
# a.k.a. "focus on the hardest exercises while training":
for _, x_, y_ in worst_batches:
_, train_acc, train_loss, train_pred = sess.run(
[optimize, accuracy, loss, pred_y],
feed_dict={
X: x_,
Y: y_,
is_train: True
}
)
# Test completely at the end of every epoch:
# Calculate accuracy and F1 score
pred_out, accuracy_out, loss_out = sess.run(
[pred_y, accuracy, loss],
feed_dict={
X: X_test,
Y: y_test,
is_train: False
}
)
# "y_test.argmax(1)": could be optimised by being computed once...
158

f1_score_out = metrics.f1_score(
y_test.argmax(1), pred_out.argmax(1), average="weighted"
)
print (
"iter: {}, ".format(i) + \
"train loss: {}, ".format(train_loss) + \
"train accuracy: {}, ".format(train_acc) + \
"train F1-score: {}, ".format(train_f1_score) + \
"test loss: {}, ".format(loss_out) + \
"prediction accuracy: {}, ".format(accuracy_out) + \
"test F1-score: {}".format(f1_score_out)
)
best_accuracy = max(best_accuracy, (accuracy_out, "iter: {}".format(i)))
best_f1_score = max(best_f1_score, (f1_score_out, "iter: {}".format(i)))
print("")
print("final prediction accuracy: {}".format(accuracy_out))
print("best epoch's prediction accuracy: {}".format(best_accuracy))
print("final F1 score: {}".format(f1_score_out))
print("best epoch's F1 score: {}".format(best_f1_score))
print("")
end_time = datetime.now()
print("End Time: ",time.ctime())
print("Exec Duration: {}".format(end_time - start_time))
print("")
159

# returning both final and bests accuracies and f1 scores.
return accuracy_out, best_accuracy, f1_score_out, best_f1_score

Config_Dataset_HAR.py
#!/usr/bin/env python
# coding: utf-8
# In[ ]:
__author__ = 'jkranbir'
# Note: Linux bash commands start with a "!" inside those "ipython notebook" cells
import os
DATA_PATH = "data/"
get_ipython().system('pwd && ls')
os.chdir(DATA_PATH)
get_ipython().system('pwd && ls')
get_ipython().system('python download_datasets.py')
get_ipython().system('pwd && ls')
os.chdir("..")
get_ipython().system('pwd && ls')
DATASET_PATH = DATA_PATH + "UCI HAR Dataset/"
print("\n" + "Dataset is now located at: " + DATASET_PATH)
# In[ ]:
__author__ = 'jkranbir'
160

from lstm_architecture import one_hot, run_with_config
import numpy as np
import os
#os.environ["CUDA_VISIBLE_DEVICES"]="0,1"
#-------------------------------------------# Neural net's config.
#-------------------------------------------class Config(object):
"""
define a class to store parameters,
the input should be feature mat of training and testing
"""
def __init__(self, X_train, X_test):
# Data shaping
self.train_count = len(X_train) # 7352 training series
self.test_data_count = len(X_test) # 2947 testing series
self.n_steps = len(X_train[0]) # 128 time_steps per series
self.n_classes = 6 # Final output classes
# Training
self.learning_rate = 0.001
self.lambda_loss_amount = 0.005
self.training_epochs = 250 #5
self.batch_size = 100
self.clip_gradients = 15.0
161

self.gradient_noise_scale = None
# Dropout is added on inputs and after each stacked layers (but not
# between residual layers).
self.keep_prob_for_dropout = 0.85 # **(1/3.0)
# Linear+relu structure
self.bias_mean = 0.3
# I would recommend between 0.1 and 1.0 or to change and use a xavier
# initializer
self.weights_stddev = 0.2
########
# NOTE: I think that if any of the below parameters are changed,
# the best is to readjust every parameters in the "Training" section
# above to properly compare the architectures only once optimised.
########
# LSTM structure
# Features count is of 9: three 3D sensors features over time
self.n_inputs = len(X_train[0][0])
self.n_hidden = 256 # nb of neurons inside the neural network
# Use bidir in every LSTM cell, or not:
self.use_bidirectionnal_cells = True #False
# High-level deep architecture
self.also_add_dropout_between_stacked_cells = False #True
# NOTE: values of exactly 1 (int) for those 2 high-level parameters below totally
disables them and result in only 1 starting LSTM.
162

# self.n_layers_in_highway = 1 # Number of residual connections to the LSTMs
(highway-style), this is did for each stacked block (inside them).
# self.n_stacked_layers = 1 # Stack multiple blocks of residual
# layers.
#-------------------------------------------# Dataset-specific constants and functions + loading
#-------------------------------------------# Useful Constants
# Those are separate normalised input features for the neural network
INPUT_SIGNAL_TYPES = [
"body_acc_x_",
"body_acc_y_",
"body_acc_z_",
"body_gyro_x_",
"body_gyro_y_",
"body_gyro_z_",
"total_acc_x_",
"total_acc_y_",
"total_acc_z_"
]
# Output classes to learn how to classify
LABELS = [
"WALKING",
"WALKING_UPSTAIRS",
163

"WALKING_DOWNSTAIRS",
"SITTING",
"STANDING",
"LAYING"
]
DATA_PATH = "data/"
DATASET_PATH = DATA_PATH + "UCI HAR Dataset/"
TRAIN = "train/"
TEST = "test/"

# Load "X" (the neural network's training and testing inputs)
def load_X(X_signals_paths):
"""
Given attribute (train or test) of feature, read all 9 features into an
np ndarray of shape [sample_sequence_idx, time_step, feature_num]
argument: X_signals_paths str attribute of feature: 'train' or 'test'
return:

np ndarray, tensor of features

"""
X_signals = []
for signal_type_path in X_signals_paths:
file = open(signal_type_path, 'r')
# Read dataset from disk, dealing with text files' syntax
X_signals.append(
164

[np.array(serie, dtype=np.float32) for serie in [
row.replace(' ', ' ').strip().split(' ') for row in file
]]
)
file.close()
return np.transpose(np.array(X_signals), (1, 2, 0))
X_train_signals_paths = [
DATASET_PATH + TRAIN + "Inertial Signals/" + signal + "train.txt" for signal in
INPUT_SIGNAL_TYPES
]
X_test_signals_paths = [
DATASET_PATH + TEST + "Inertial Signals/" + signal + "test.txt" for signal in
INPUT_SIGNAL_TYPES
]
X_train = load_X(X_train_signals_paths)
X_test = load_X(X_test_signals_paths)
# Load "y" (the neural network's training and testing outputs)
def load_y(y_path):
"""
Read Y file of values to be predicted
argument: y_path str attibute of Y: 'train' or 'test'
return: Y ndarray / tensor of the 6 one_hot labels of each sample
"""
file = open(y_path, 'r')
# Read dataset from disk, dealing with text file's syntax
y_ = np.array(
[elem for elem in [
165

row.replace(' ', ' ').strip().split(' ') for row in file
]],
dtype=np.int32
)
file.close()
# Substract 1 to each output class for friendly 0-based indexing
return one_hot(y_ - 1)
y_train_path = DATASET_PATH + TRAIN + "y_train.txt"
y_test_path = DATASET_PATH + TEST + "y_test.txt"
y_train = load_y(y_train_path)
y_test = load_y(y_test_path)
#-------------------------------------------# Training (maybe multiple) experiment(s)
#-------------------------------------------n_layers_in_highway = 4
n_stacked_layers = 4
trial_name = "{}x{}".format(n_layers_in_highway, n_stacked_layers)
for learning_rate in [0.001]: # [0.01, 0.001, 0.0001]:
for lambda_loss_amount in [0.005]:
for clip_gradients in [15.0]:
print ("learning_rate: {}".format(learning_rate))
print ("lambda_loss_amount: {}".format(lambda_loss_amount))
print ("")
166

class EditedConfig(Config):
def __init__(self, X, Y):
super(EditedConfig, self).__init__(X, Y)
# Edit only some parameters:
self.learning_rate = learning_rate
self.lambda_loss_amount = lambda_loss_amount
self.clip_gradients = clip_gradients
# Architecture params:
self.n_layers_in_highway = n_layers_in_highway
self.n_stacked_layers = n_stacked_layers
# # Useful catch upon looping (e.g.: not enough memory)
# try:
#

accuracy_out, best_accuracy = run_with_config(EditedConfig)

# except:
#

accuracy_out, best_accuracy = -1, -1

accuracy_out, best_accuracy, f1_score_out, best_f1_score = (
run_with_config(EditedConfig, X_train, y_train, X_test, y_test)
)
print (accuracy_out, best_accuracy, f1_score_out, best_f1_score)
with open('{}_result_HAR_6.txt'.format(trial_name), 'a') as f:
f.write(str(learning_rate) + ' \t' + str(lambda_loss_amount) + ' \t' +
str(clip_gradients) + ' \t' + str(
accuracy_out) + ' \t' + str(best_accuracy) + ' \t' + str(f1_score_out) + ' \t' +
str(best_f1_score) + '\n\n')
167

print ("________________________________________________________")
print ("")
print ("Done.")

# In[ ]:

B.2 PyTorch Code
Script.py
__author__ = 'jkranbir'
# Note: Linux bash commands start with a "!" inside those "ipython notebook" cells
import os
DATA_PATH = "data/"
!pwd && ls
os.chdir(DATA_PATH)
!pwd && ls
!python download_datasets.py
!pwd && ls
os.chdir("..")
!pwd && ls
168

DATASET_PATH = DATA_PATH + "UCI HAR Dataset/"
print("\n" + "Dataset is now located at: " + DATASET_PATH)

network_1.py
# encoding=utf-8
"""
Created on 12:48 2019/03/10
@author: Jagadish Kumar Ranbirsingh
"""
import torch.nn as nn
import torch.nn.functional as F
class Network(nn.Module):
def __init__(self):
super(Network, self).__init__()
self.conv1 = nn.Sequential(
nn.Conv2d(in_channels=9, out_channels=32, kernel_size=(1, 9)),
# nn.BatchNorm1d()
nn.ReLU(),
nn.MaxPool2d(kernel_size=(1, 2), stride=2)
)
self.conv2 = nn.Sequential(
169

nn.Conv2d(in_channels=32, out_channels=64, kernel_size=(1, 9)),
nn.ReLU(),
nn.MaxPool2d(kernel_size=(1, 2), stride=2)
)
self.fc1 = nn.Sequential(
nn.Linear(in_features=64 * 26, out_features=1000),
nn.ReLU()
)
self.fc2 = nn.Sequential(
nn.Linear(in_features=1000, out_features=500),
nn.ReLU()
)
self.fc3 = nn.Sequential(
nn.Linear(in_features=500, out_features=6)
)
def forward(self, x):
out = self.conv1(x)
out = self.conv2(out)
out = out.reshape(-1, 64 * 26)
out = self.fc1(out)
out = self.fc2(out)
out = self.fc3(out)
out = F.softmax(out, dim=1)
return out
network.py
170

# encoding=utf-8
"""
Created on 12:48 2019/03/10
@author: Jagadish Kumar Ranbirsingh
"""
import torch.nn as nn
import torch.nn.functional as F

class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__()
self.conv1 = nn.Sequential(
nn.Conv2d(in_channels=9, out_channels=32, kernel_size=(1, 9)),
# nn.BatchNorm1d()
nn.ReLU(),
nn.MaxPool2d(kernel_size=(1, 2), stride=2)
)
self.conv2 = nn.Sequential(
nn.Conv2d(in_channels=32, out_channels=64, kernel_size=(1, 9)),
nn.ReLU(),
nn.MaxPool2d(kernel_size=(1, 2), stride=2)
)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Sequential(
nn.Linear(in_features=64 * 26, out_features=1000),
nn.ReLU()
171

)
self.fc2 = nn.Sequential(
nn.Linear(in_features=1000, out_features=500),
nn.ReLU()
)
self.fc3 = nn.Sequential(
nn.Linear(in_features=500, out_features=6)
)
def forward(self, x):
out = self.conv1(x)
out = self.conv2_drop(self.conv2(out))
out = out.view(-1, 64 * 26)
out = self.fc1(out)
out = self.fc2(out)
out = self.fc3(out)
return out
class Network(nn.Module):
def __init__(self):
super(Network, self).__init__()
self.cnn = CNN()
self.rnn = nn.LSTM(64 * 26, 6, 2)
def forward(self, x):
out = self.cnn(x)
out = self.rnn(out)
out = F.softmax(out, dim=1)
return out
172

data_preprocess.py
# encoding=utf-8
"""
Created on 07:51 2019/03/10
@author: Jagadish Kumar Ranbirsingh
"""
import numpy as np
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
# This is for parsing the X data, you can ignore it if you do not need preprocessing
def format_data_x(datafile):
x_data = None
for item in datafile:
item_data = np.loadtxt(item, dtype=np.float)
if x_data is None:
x_data = np.zeros((len(item_data), 1))
x_data = np.hstack((x_data, item_data))
x_data = x_data[:, 1:]
print(x_data.shape)
X = None
for i in range(len(x_data)):
row = np.asarray(x_data[i, :])
row = row.reshape(9, 128).T
if X is None:
X = np.zeros((len(x_data), 128, 9))
173

X[i] = row
print(X.shape)
return X
# This is for parsing the Y data, you can ignore it if you do not need preprocessing
def format_data_y(datafile):
data = np.loadtxt(datafile, dtype=np.int) - 1
YY = np.eye(6)[data]
return YY
# This for processing the dataset from scratch
# After script downloading the dataset, program put it in the DATA_PATH folder
def load_data():
DATA_PATH = 'data/'
DATASET_PATH = DATA_PATH + 'UCI HAR Dataset/'
TRAIN = 'train/'
TEST = 'test/'
INPUT_SIGNAL_TYPES = [
"body_acc_x_",
"body_acc_y_",
"body_acc_z_",
"body_gyro_x_",
"body_gyro_y_",
"body_gyro_z_",
"total_acc_x_",
"total_acc_y_",
174

"total_acc_z_"
]
str_train_files = [DATASET_PATH + TRAIN + 'Inertial Signals/' + item + 'train.txt'
for item in
INPUT_SIGNAL_TYPES]
str_test_files = [DATASET_PATH + TEST + 'Inertial Signals/' + item + 'test.txt' for
item in INPUT_SIGNAL_TYPES]
str_train_y = DATASET_PATH + TRAIN + 'y_train.txt'
str_test_y = DATASET_PATH + TEST + 'y_test.txt'
X_train = format_data_x(str_train_files)
X_test = format_data_x(str_test_files)
Y_train = format_data_y(str_train_y)
Y_test = format_data_y(str_test_y)
return X_train, onehot_to_label(Y_train), X_test, onehot_to_label(Y_test)
def onehot_to_label(y_onehot):
a = np.argwhere(y_onehot == 1)
return a[:, -1]
class data_loader(Dataset):
def __init__(self, samples, labels, t):
self.samples = samples
self.labels = labels
self.T = t
def __getitem__(self, index):
sample, target = self.samples[index], self.labels[index]
return self.T(sample), target
175

def __len__(self):
return len(self.samples)
def load(batch_size=100):
x_train, y_train, x_test, y_test = load_data()
x_train, x_test = x_train.reshape((-1, 9, 1, 128)), x_test.reshape((-1, 9, 1, 128))
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean=(0,0,0,0,0,0,0,0,0), std=(1,1,1,1,1,1,1,1,1))
])
train_set = data_loader(x_train, y_train, transform)
test_set = data_loader(x_test, y_test, transform)
train_loader = DataLoader(train_set, batch_size=batch_size, num_workers=8,
pin_memory=True, shuffle=True, drop_last=True)
test_loader = DataLoader(test_set, batch_size=batch_size, num_workers=8,
pin_memory=True, shuffle=False)
return train_loader, test_loader
Config_Dataset_HAR.py
#!/usr/bin/env python
# coding: utf-8
# In[ ]:
__author__ = 'jkranbir'
# Note: Linux bash commands start with a "!" inside those "ipython notebook" cells
import os
176

DATA_PATH = "data/"
get_ipython().system('pwd && ls')
os.chdir(DATA_PATH)
get_ipython().system('pwd && ls')
get_ipython().system('python download_datasets.py')
get_ipython().system('pwd && ls')
os.chdir("..")
get_ipython().system('pwd && ls')
DATASET_PATH = DATA_PATH + "UCI HAR Dataset/"
print("\n" + "Dataset is now located at: " + DATASET_PATH)
# In[ ]:
# encoding=utf-8
"""
Created on 09:41 2019/03/10
@author: Jagadish Kumar Ranbirsingh
"""
import data_preprocess
import matplotlib.pyplot as plt
import network as net
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import tqdm
from datetime import datetime
177

import time
BATCH_SIZE = 100 #256
N_EPOCH = 10 * 250 #250 In dataset 7352 training series
LEARNING_RATE = 0.001
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device :',DEVICE)
print('Device Count',torch.cuda.device_count())
print('Current Device:',torch.cuda.get_device_name(torch.cuda.current_device()))
result = [ ]
def train(model, optimizer, train_loader, test_loader):
n_batch = len(train_loader.dataset) // BATCH_SIZE
print('n_batch',n_batch)
criterion = nn.CrossEntropyLoss()
for e in range(N_EPOCH):
model.train()
correct, total_loss = 0, 0
total = 0
for index, (sample, target) in enumerate(train_loader):
sample, target = sample.to(DEVICE).float(), target.to(DEVICE).long()
sample = sample.view(-1, 9, 1, 128)
output = model(sample)
loss = criterion(output, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
178

total_loss += loss.item()
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum()
if index % 20 == 0:
tqdm.tqdm.write('Epoch: [{}/{}], Batch: [{}/{}], loss:{:.4f}'.format(e + 1,
N_EPOCH, index + 1, n_batch, loss.item()))
acc_train = float(correct) * 100.0 / (BATCH_SIZE * n_batch)
tqdm.tqdm.write(
'Epoch: [{}/{}], loss: {:.4f}, train acc: {:.2f}%'.format(e + 1, N_EPOCH,
total_loss * 1.0 / n_batch, acc_train))

# Testing
model.train(False)
with torch.no_grad():
correct, total = 0, 0
for sample, target in test_loader:
sample, target = sample.to(DEVICE).float(), target.to(DEVICE).long()
sample = sample.view(-1, 9, 1, 128)
output = model(sample)
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum()
acc_test = float(correct) * 100 / total
tqdm.tqdm.write('Epoch: [{}/{}], test acc: {:.2f}%'.format(e + 1, N_EPOCH,
179

float(correct) * 100 / total))
result.append([acc_train, acc_test])
result_np = np.array(result, dtype=float)
np.savetxt('result.csv', result_np, fmt='%.2f', delimiter=',')
def plot():
data = np.loadtxt('result.csv', delimiter=',')
plt.figure()
plt.plot(range(1, len(data[:, 0]) + 1), data[:, 0], color='blue', label='train')
plt.plot(range(1, len(data[:, 1]) + 1), data[:, 1], color='red', label='test')
plt.legend()
plt.xlabel('Epoch', fontsize=14)
plt.ylabel('Accuracy (%)', fontsize=14)
plt.title('Training and Prediction Accuracy', fontsize=20)
if __name__ == '__main__':
torch.cuda.manual_seed_all(10)
start_time = datetime.now()
print ("Start Time: ",time.ctime())
print ("")
train_loader, test_loader = data_preprocess.load(batch_size=BATCH_SIZE)
model = net.Network()
model = model.to(DEVICE)
#torch.distributed.init_process_group(backend="nccl")
#model = torch.nn.parallel.DistributedDataParallel(model)
optimizer = optim.SGD(params=model.parameters(), lr=LEARNING_RATE,
momentum=0.9)
train(model, optimizer, train_loader, test_loader)
180

result = np.array(result, dtype=float)
np.savetxt('result.csv', result, fmt='%.2f', delimiter=',')
plot()
print("")
end_time = datetime.now()
print("End Time: ",time.ctime())
print("Exec Duration: {}".format(end_time - start_time))
print("")
# In[ ]:

data/download_datasets.py : It’s same as TensorFlow.

B.3 Raspberry PI Cluster – Monte Carlo Simulation
server.py
import sys
import tensorflow as tf
import netifaces as ni
def getIpAddr():
ni.ifaddresses("eth0")
ip = ni.ifaddresses("eth0")[ni.AF_INET][0]["addr"]
return ip

181

taskList =
["192.168.1.16:1024","192.168.1.17:1024","192.168.1.18:1024","192.168.1.19:1024",
"192.168.1.20:1024","192.168.1.21:1024","192.168.1.22:1024","192.168.1.23:1024",
"192.168.1.24:1024","192.168.1.25:1024","192.168.1.26:1024","192.168.1.27:1024",
"192.168.1.28:1024","192.168.1.29:1024","192.168.1.30:1024","192.168.1.31:1024"]
taskName = getIpAddr()+":1024"
try:
taskNum = taskList.index(taskName)
except ValueError:
print(" Unable to find " + taskName + " in the task List.")
quit()
cluster = tf.train.ClusterSpec({"local":taskList})
server = tf.train.Server(cluster,job_name="local",task_index=taskNum)
server.join()
client.py
import tensorflow as tf
import numpy as np
import math
import time
start = time.time()
size = int(1*math.pow(10,6))
182

taskList =
["192.168.1.16:1024","192.168.1.17:1024","192.168.1.18:1024","192.168.1.19:1024",
"192.168.1.20:1024","192.168.1.21:1024","192.168.1.22:1024","192.168.1.23:1024",
"192.168.1.24:1024","192.168.1.25:1024","192.168.1.26:1024","192.168.1.27:1024",
"192.168.1.28:1024","192.168.1.29:1024","192.168.1.30:1024","192.168.1.31:1024"]
taskCount = len(taskList)
n = size//taskCount
r = size % taskCount
cluster = tf.train.ClusterSpec({"local":taskList})
total = tf.Variable(0,dtype=tf.float32)
for i in range(0,taskCount):
if (i==0):
sampleSize = n + r
else:
sampleSize = n
deviceName = "/job:local/task:"+str(i)
with tf.device(deviceName):
pointList = tf.random_uniform(shape=[sampleSize,2],minval=1,maxval=1,dtype=tf.float32)
distanceList = tf.sqrt(tf.reduce_sum(tf.pow(pointList,2),1))
boolList = tf.less(distanceList,1)
circleCount = tf.reduce_sum(tf.cast(boolList,tf.float32))
183

total = total + circleCount
print("task:",i," sampleSize: ",sampleSize)
with tf.Session("grpc://localhost:1024") as sess:
sess.run(tf.global_variables_initializer())
pi = sess.run(4*(total/size))
print("pi:",pi)
end = time.time()
totalTime = end - start
print("Time: {:,3f}".format(totalTime))

B.4 Raspberry PI Cluster Code
lstm_architecture.py
__author__ = 'jk_ranbir'
import tensorflow as tf
from sklearn import metrics
from sklearn.utils import shuffle
import numpy as np
from datetime import datetime
import time
import sys
#import tensorflow as tf
import netifaces as ni
184

def getIpAddr():
ni.ifaddresses("eth0")
ip = ni.ifaddresses("eth0")[ni.AF_INET][0]["addr"]
return ip
tf.app.flags.DEFINE_string("job_name", "", "Either 'ps' or 'worker'")
FLAGS = tf.app.flags.FLAGS
parameter_servers = ["192.168.1.26:1024"]
workers =
["192.168.1.16:1024","192.168.1.17:1024","192.168.1.18:1024","192.168.1.19:1024",
"192.168.1.20:1024","192.168.1.21:1024","192.168.1.22:1024","192.168.1.23:1024",
"192.168.1.24:1024","192.168.1.25:1024","192.168.1.26:1024","192.168.1.27:1024",
"192.168.1.28:1024","192.168.1.29:1024","192.168.1.30:1024","192.168.1.31:1024"]
taskName = getIpAddr()+":1024"
try:
taskNum = workers.index(taskName)
except ValueError:
print(" Unable to find " + taskName + " in the worker group.")
quit()
cluster = tf.train.ClusterSpec({"ps":parameter_servers, "worker":workers})
185

server = tf.train.Server(cluster,job_name=FLAGS.job_name,task_index=taskNum)
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
def one_hot(y):
"""convert label from dense to one hot
argument:
label: ndarray dense label ,shape: [sample_num,1]
return:
one_hot_label: ndarray one hot, shape: [sample_num,n_class]
"""
# e.g.: [[5], [0], [3]] --> [[0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0]]
y = y.reshape(len(y))
n_values = np.max(y) + 1
return np.eye(n_values)[np.array(y, dtype=np.int32)] # Returns FLOATS

def batch_norm(input_tensor, config, i):
# Implementing batch normalisation: this is used out of the residual layers
# to normalise those output neurons by mean and standard deviation.
if config.n_layers_in_highway == 0:
# There is no residual layers, no need for batch_norm:
return input_tensor
with tf.variable_scope("batch_norm") as scope:
if i != 0:
186

# Do not create extra variables for each time step
scope.reuse_variables()
# Mean and variance normalisation simply crunched over all axes
axes = list(range(len(input_tensor.get_shape())))
mean, variance = tf.nn.moments(input_tensor, axes=axes, shift=None,
name=None, keep_dims=False)
stdev = tf.sqrt(variance+0.001)
# Rescaling
bn = input_tensor - mean
bn /= stdev
# Learnable extra rescaling
# tf.get_variable("relu_fc_weights",
initializer=tf.random_normal(mean=0.0, stddev=0.0)
bn *= tf.get_variable("a_noreg", initializer=tf.random_normal([1],
mean=0.5, stddev=0.0))
bn += tf.get_variable("b_noreg", initializer=tf.random_normal([1],
mean=0.0, stddev=0.0))
# bn *= tf.Variable(0.5, name=(scope.name + "/a_noreg"))
# bn += tf.Variable(0.0, name=(scope.name + "/b_noreg"))
return bn
def relu_fc(input_2D_tensor_list, features_len, new_features_len, config):
"""make a relu fully-connected layer, mainly change the shape of tensor
both input and output is a list of tensor
argument:
input_2D_tensor_list: list shape is [batch_size,feature_num]
features_len: int the initial features length of input_2D_tensor
new_feature_len: int the final features length of output_2D_tensor
187

config: Config used for weights initializers
return:
output_2D_tensor_list lit shape is [batch_size,new_feature_len]
"""
W = tf.get_variable(
"relu_fc_weights",
initializer=tf.random_normal(
[features_len, new_features_len],
mean=0.0,
stddev=float(config.weights_stddev)
)
)
b = tf.get_variable(
"relu_fc_biases_noreg",
initializer=tf.random_normal(
[new_features_len],
mean=float(config.bias_mean),
stddev=float(config.weights_stddev)
)
)
# intra-timestep multiplication:
output_2D_tensor_list = [
tf.nn.relu(tf.matmul(input_2D_tensor, W) + b)
for input_2D_tensor in input_2D_tensor_list
]
return output_2D_tensor_list
def single_LSTM_cell(input_hidden_tensor, n_outputs):
""" define the basic LSTM layer
188

argument:
input_hidden_tensor: list a list of tensor,
shape: time_steps*[batch_size,n_inputs]
n_outputs: int num of LSTM layer output
return:
outputs: list a time_steps list of tensor,
shape: time_steps*[batch_size,n_outputs]
"""
with tf.variable_scope("lstm_cell"):
lstm_cell = tf.nn.rnn_cell.LSTMCell(n_outputs, state_is_tuple=True,
forget_bias=0.999)
outputs, _ = tf.nn.static_rnn(lstm_cell, input_hidden_tensor,
dtype=tf.float32)
return outputs
def bi_LSTM_cell(input_hidden_tensor, n_inputs, n_outputs, config):
"""build bi-LSTM, concatenating the two directions in an inner manner.
argument:
input_hidden_tensor: list a time_steps series of tensor, shape:
[sample_num, n_inputs]
n_inputs: int units of input tensor
n_outputs: int units of output tensor, each bi-LSTM will have half those
internal units
config: Config used for the relu_fc
return:
layer_hidden_outputs: list a time_steps series of tensor, shape:
[sample_num, n_outputs]
"""
n_outputs = int(n_outputs/2)
189

print ("bidir:")
with tf.variable_scope('pass_forward') as scope2:
hidden_forward = relu_fc(input_hidden_tensor, n_inputs, n_outputs,
config)
forward = single_LSTM_cell(hidden_forward, n_outputs)
print (len(hidden_forward), str(hidden_forward[0].get_shape()))
# Backward pass is as simple as surrounding the cell with a double inversion:
with tf.variable_scope('pass_backward') as scope2:
hidden_backward = relu_fc(input_hidden_tensor, n_inputs, n_outputs,
config)
backward =
list(reversed(single_LSTM_cell(list(reversed(hidden_backward)), n_outputs)))
with tf.variable_scope('bidir_concat') as scope:
# Simply concatenating cells' outputs at each timesteps on the innermost
# dimension, like if the two cells acted as one cell
# with twice the n_hidden size:
layer_hidden_outputs = [
tf.concat([f, b], len(f.get_shape()) - 1)
for f, b in zip(forward, backward)]
return layer_hidden_outputs
def residual_bidirectional_LSTM_layers(input_hidden_tensor, n_input, n_output,
layer_level, config, keep_prob_for_dropout):
"""This architecture is only enabled if "config.n_layers_in_highway" has a
value only greater than int(0). The arguments are same than for bi_LSTM_cell.
arguments:
190

input_hidden_tensor: list a time_steps series of tensor, shape:
[sample_num, n_inputs]
n_inputs: int units of input tensor
n_outputs: int units of output tensor, each bi-LSTM will have half those
internal units
config: Config used for determining if there are residual connections and
if yes, their number and with some batch_norm.
return:
layer_hidden_outputs: list a time_steps series of tensor, shape:
[sample_num, n_outputs]
"""
with tf.variable_scope('layer_{}'.format(layer_level)) as scope:
if config.use_bidirectionnal_cells:
get_lstm = lambda input_tensor: bi_LSTM_cell(input_tensor, n_input,
n_output, config)
else:
get_lstm = lambda input_tensor:
single_LSTM_cell(relu_fc(input_tensor, n_input, n_output, config), n_output)
def add_highway_redisual(layer, residual_minilayer):
return [a + b for a, b in zip(layer, residual_minilayer)]
hidden_LSTM_layer = get_lstm(input_hidden_tensor)
# Adding K new (residual bidir) connections to this first layer:
for i in range(config.n_layers_in_highway - 1):
with tf.variable_scope('LSTM_residual_{}'.format(i)) as scope2:
hidden_LSTM_layer = add_highway_redisual(
hidden_LSTM_layer,
191

get_lstm(input_hidden_tensor) )
if config.also_add_dropout_between_stacked_cells:
hidden_LSTM_layer = [tf.nn.dropout(out, keep_prob_for_dropout) for
out in hidden_LSTM_layer]
return [batch_norm(out, config, i) for i, out in
enumerate(hidden_LSTM_layer)]
def LSTM_network(feature_mat, config, keep_prob_for_dropout):
"""model a LSTM Network,
it stacks 2 LSTM layers, each layer has n_hidden=32 cells
and 1 output layer, it is a full connet layer
argument:
feature_mat: ndarray fature matrix,
shape=[batch_size,time_steps,n_inputs]
config: class containing config of network
return:
: ndarray output shape [batch_size, n_classes]
"""
with tf.variable_scope('LSTM_network') as scope: # TensorFlow graph
naming
feature_mat = tf.nn.dropout(feature_mat, keep_prob_for_dropout)
# Exchange dim 1 and dim 0
feature_mat = tf.transpose(feature_mat, [1, 0, 2])
print (feature_mat.get_shape())
# New feature_mat's shape: [time_steps, batch_size, n_inputs]
# Temporarily crush the feature_mat's dimensions
feature_mat = tf.reshape(feature_mat, [-1, config.n_inputs])
192

print (feature_mat.get_shape())
# New feature_mat's shape: [time_steps*batch_size, n_inputs]
# Split the series because the rnn cell needs time_steps features, each of
shape:
hidden = tf.split(feature_mat, config.n_steps, 0)
print (len(hidden), str(hidden[0].get_shape()))
# New shape: a list of lenght "time_step" containing tensors of shape
[batch_size, n_hidden]
# Stacking LSTM cells, at least one is stacked:
print ("\nCreating hidden #1:")
hidden = residual_bidirectional_LSTM_layers(hidden, config.n_inputs,
config.n_hidden, 1, config, keep_prob_for_dropout)
print (len(hidden), str(hidden[0].get_shape()))
for stacked_hidden_index in range(config.n_stacked_layers - 1):
# If the config permits it, we stack more lstm cells:
print ("\nCreating hidden #{}:".format(stacked_hidden_index+2))
hidden = residual_bidirectional_LSTM_layers(hidden, config.n_hidden,
config.n_hidden, stacked_hidden_index+2, config, keep_prob_for_dropout)
print (len(hidden), str(hidden[0].get_shape()))
print ("")
# Final fully-connected activation logits
# Get the last output tensor of the inner loop output series, of shape
[batch_size, n_classes]
last_hidden = tf.nn.dropout(hidden[-1], keep_prob_for_dropout)
last_logits = relu_fc(
193

[last_hidden],
config.n_hidden, config.n_classes, config
)[0]
return last_logits
def run_with_config(Config, X_train, y_train, X_test, y_test):
start_time = datetime.now()
print ("Start Time: ",time.ctime())
print ("")
tf.reset_default_graph() # To enable to run multiple things in a loop
#----------------------------------# Define parameters for model
#----------------------------------config = Config(X_train, X_test)
print("Some useful info to get an insight on dataset's shape and normalisation:")
print("features shape, labels shape, each features mean, each features standard
deviation")
print(X_test.shape, y_test.shape,
np.mean(X_test), np.std(X_test))
print("the dataset is therefore properly normalised, as expected.")
#-----------------------------------------------------# Let's get serious and build the neural network
#-----------------------------------------------------#with tf.device("/cpu:0"): # Remove this line to use GPU. If you have a too
small GPU, it crashes.
with tf.device(tf.train.replica_device_setter(cluster=cluster)):
#with tf.device('/gpu:0'):
194

#with tf.device('/gpu:1'):
#mirrored_strategy = tf.contrib.distribute.MirroredStrategy(devices=["/gpu:0",
"/gpu:1"])
#with mirrored_strategy.scope():
X = tf.placeholder(tf.float32, [
None, config.n_steps, config.n_inputs], name="X")
Y = tf.placeholder(tf.float32, [
None, config.n_classes], name="Y")
# is_train for dropout control:
is_train = tf.placeholder(tf.bool, name="is_train")
keep_prob_for_dropout = tf.cond(is_train,
lambda: tf.constant(
config.keep_prob_for_dropout,
name="keep_prob_for_dropout"
),
lambda: tf.constant(
1.0,
name="keep_prob_for_dropout"
)
)
pred_y = LSTM_network(X, config, keep_prob_for_dropout)
# Loss, optimizer, evaluation
# Softmax loss with L2 and L1 layer-wise regularisation
print ("Unregularised variables:")
for unreg in [tf_var.name for tf_var in tf.trainable_variables() if ("noreg"
in tf_var.name or "Bias" in tf_var.name)]:
print (unreg)
195

l2 = config.lambda_loss_amount * sum(
tf.nn.l2_loss(tf_var)
for tf_var in tf.trainable_variables()
if not ("noreg" in tf_var.name or "Bias" in tf_var.name)
)
# first_weights = [w for w in tf.all_variables() if w.name ==
'LSTM_network/layer_1/pass_forward/relu_fc_weights:0'][0]
# l1 = config.lambda_loss_amount * tf.reduce_mean(tf.abs(first_weights))
loss = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits_v2(logits=pred_y,labels=Y))
+ l2 # + l1
# Gradient clipping Adam optimizer with gradient noise
optimize = tf.contrib.layers.optimize_loss(
loss,
global_step=tf.Variable(0),
learning_rate=config.learning_rate,
optimizer=tf.train.AdamOptimizer(learning_rate=config.learning_rate),
clip_gradients=config.clip_gradients,
gradient_noise_scale=config.gradient_noise_scale
)
correct_pred = tf.equal(tf.argmax(pred_y, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, dtype=tf.float32))
#-------------------------------------------# Hooray, now train the neural network
#-------------------------------------------# Note that log_device_placement can be turned off for less console spam.
196

#sessconfig = tf.ConfigProto(log_device_placement=False)
sessconfig = tf.ConfigProto(allow_soft_placement =
True,log_device_placement=False)
#sessconfig.gpu_options.allow_growth = True
#with tf.Session(config=sessconfig) as sess:
with tf.Session("grpc://localhost:1024") as sess:
#init = tf.global_variables_initializer()
sess.run(tf.global_variables_initializer())
best_accuracy = (0.0, "iter: -1")
best_f1_score = (0.0, "iter: -1")
# Start training for each batch and loop epochs
worst_batches = []
for i in range(config.training_epochs):
# Loop batches for an epoch:
shuffled_X, shuffled_y = shuffle(X_train, y_train, random_state=i*42)
for start, end in zip(range(0, config.train_count, config.batch_size),
range(config.batch_size, config.train_count + 1,
config.batch_size)):
_, train_acc, train_loss, train_pred = sess.run(
[optimize, accuracy, loss, pred_y],
feed_dict={
X: shuffled_X[start:end],
Y: shuffled_y[start:end],
is_train: True
}
)
worst_batches.append(
197

(train_loss, shuffled_X[start:end], shuffled_y[start:end])
)
worst_batches = list(sorted(worst_batches))[-5:] # Keep 5 poorest
# Train F1 score is not on boosting
train_f1_score = metrics.f1_score(
shuffled_y[start:end].argmax(1), train_pred.argmax(1),
average="weighted"
)
# Retrain on top worst batches of this epoch (boosting):
# a.k.a. "focus on the hardest exercises while training":
for _, x_, y_ in worst_batches:
_, train_acc, train_loss, train_pred = sess.run(
[optimize, accuracy, loss, pred_y],
feed_dict={
X: x_,
Y: y_,
is_train: True
}
)
# Test completely at the end of every epoch:
# Calculate accuracy and F1 score
pred_out, accuracy_out, loss_out = sess.run(
[pred_y, accuracy, loss],
feed_dict={
X: X_test,
Y: y_test,
198

is_train: False
}
)
# "y_test.argmax(1)": could be optimised by being computed once...
f1_score_out = metrics.f1_score(
y_test.argmax(1), pred_out.argmax(1), average="weighted"
)
print (
"iter: {}, ".format(i) + \
"train loss: {}, ".format(train_loss) + \
"train accuracy: {}, ".format(train_acc) + \
"train F1-score: {}, ".format(train_f1_score) + \
"test loss: {}, ".format(loss_out) + \
"test accuracy: {}, ".format(accuracy_out) + \
"test F1-score: {}".format(f1_score_out)
)
best_accuracy = max(best_accuracy, (accuracy_out, "iter:
{}".format(i)))
best_f1_score = max(best_f1_score, (f1_score_out, "iter: {}".format(i)))
print("")
print("final test accuracy: {}".format(accuracy_out))
print("best epoch's test accuracy: {}".format(best_accuracy))
print("final F1 score: {}".format(f1_score_out))
print("best epoch's F1 score: {}".format(best_f1_score))
print("")
end_time = datetime.now()
print("End Time: ",time.ctime())
print("Exec Duration: {}".format(end_time - start_time))
199

print("")
# returning both final and bests accuracies and f1 scores.
return accuracy_out, best_accuracy, f1_score_out, best_f1_score

Config_Dataset_HAR.py
#!/usr/bin/env python
# coding: utf-8
# In[ ]:
__author__ = 'jkranbir'
# Note: Linux bash commands start with a "!" inside those "ipython notebook" cells
import os
DATA_PATH = "data/"
get_ipython().system('pwd && ls')
os.chdir(DATA_PATH)
get_ipython().system('pwd && ls')
get_ipython().system('python download_datasets.py')
get_ipython().system('pwd && ls')
os.chdir("..")
get_ipython().system('pwd && ls')
DATASET_PATH = DATA_PATH + "UCI HAR Dataset/"
print("\n" + "Dataset is now located at: " + DATASET_PATH)
200

# In[ ]:
__author__ = 'jkranbir'
from lstm_architecture import one_hot, run_with_config
import numpy as np
import os
#os.environ["CUDA_VISIBLE_DEVICES"]="0,1"
#-------------------------------------------# Neural net's config.
#-------------------------------------------class Config(object):
"""
define a class to store parameters,
the input should be feature mat of training and testing
"""
def __init__(self, X_train, X_test):
workers =
["192.168.1.16:1024","192.168.1.17:1024","192.168.1.18:1024","192.168.1.19:1024",
"192.168.1.20:1024","192.168.1.21:1024","192.168.1.22:1024","192.168.1.23:1024",
"192.168.1.24:1024","192.168.1.25:1024","192.168.1.26:1024","192.168.1.27:1024",
201

"192.168.1.28:1024","192.168.1.29:1024","192.168.1.30:1024","192.168.1.31:1024"]
taskCount = len(workers)
cluster = tf.train.ClusterSpec({"worker":workers})
# Data shaping
self.train_count = len(X_train)/taskCount # 7352/16 training series
self.test_data_count = len(X_test)/taskCount # 2947/16 testing series
self.n_steps = len(X_train[0]) # 128 time_steps per series
self.n_classes = 6 # Final output classes
# Training
self.learning_rate = 0.001
self.lambda_loss_amount = 0.005
self.training_epochs = 250 #5
self.batch_size = 100
self.clip_gradients = 15.0
self.gradient_noise_scale = None
# Dropout is added on inputs and after each stacked layers (but not
# between residual layers).
self.keep_prob_for_dropout = 0.85 # **(1/3.0)
# Linear+relu structure
self.bias_mean = 0.3
# I would recommend between 0.1 and 1.0 or to change and use a xavier
# initializer
self.weights_stddev = 0.2
########
202

# NOTE: I think that if any of the below parameters are changed,
# the best is to readjust every parameters in the "Training" section
# above to properly compare the architectures only once optimised.
########
# LSTM structure
# Features count is of 9: three 3D sensors features over time
self.n_inputs = len(X_train[0][0])
self.n_hidden = 256 # nb of neurons inside the neural network
# Use bidir in every LSTM cell, or not:
self.use_bidirectionnal_cells = True #False
# High-level deep architecture
self.also_add_dropout_between_stacked_cells = False #True
# NOTE: values of exactly 1 (int) for those 2 high-level parameters below totally
disables them and result in only 1 starting LSTM.
# self.n_layers_in_highway = 1 # Number of residual connections to the LSTMs
(highway-style), this is did for each stacked block (inside them).
# self.n_stacked_layers = 1 # Stack multiple blocks of residual
# layers.

#-------------------------------------------# Dataset-specific constants and functions + loading
#-------------------------------------------# Useful Constants
203

# Those are separate normalised input features for the neural network
INPUT_SIGNAL_TYPES = [
"body_acc_x_",
"body_acc_y_",
"body_acc_z_",
"body_gyro_x_",
"body_gyro_y_",
"body_gyro_z_",
"total_acc_x_",
"total_acc_y_",
"total_acc_z_"
]
# Output classes to learn how to classify
LABELS = [
"WALKING",
"WALKING_UPSTAIRS",
"WALKING_DOWNSTAIRS",
"SITTING",
"STANDING",
"LAYING"
]
DATA_PATH = "data/"
DATASET_PATH = DATA_PATH + "UCI HAR Dataset/"
TRAIN = "train/"
TEST = "test/"
204

# Load "X" (the neural network's training and testing inputs)
def load_X(X_signals_paths):
"""
Given attribute (train or test) of feature, read all 9 features into an
np ndarray of shape [sample_sequence_idx, time_step, feature_num]
argument: X_signals_paths str attribute of feature: 'train' or 'test'
return:

np ndarray, tensor of features

"""
X_signals = []
for signal_type_path in X_signals_paths:
file = open(signal_type_path, 'r')
# Read dataset from disk, dealing with text files' syntax
X_signals.append(
[np.array(serie, dtype=np.float32) for serie in [
row.replace(' ', ' ').strip().split(' ') for row in file
]]
)
file.close()
return np.transpose(np.array(X_signals), (1, 2, 0))
X_train_signals_paths = [
DATASET_PATH + TRAIN + "Inertial Signals/" + signal + "train.txt" for signal in
INPUT_SIGNAL_TYPES
]
X_test_signals_paths = [
DATASET_PATH + TEST + "Inertial Signals/" + signal + "test.txt" for signal in
INPUT_SIGNAL_TYPES
]
205

X_train = load_X(X_train_signals_paths)
X_test = load_X(X_test_signals_paths)
# Load "y" (the neural network's training and testing outputs)
def load_y(y_path):
"""
Read Y file of values to be predicted
argument: y_path str attibute of Y: 'train' or 'test'
return: Y ndarray / tensor of the 6 one_hot labels of each sample
"""
file = open(y_path, 'r')
# Read dataset from disk, dealing with text file's syntax
y_ = np.array(
[elem for elem in [
row.replace(' ', ' ').strip().split(' ') for row in file
]],
dtype=np.int32
)
file.close()
# Substract 1 to each output class for friendly 0-based indexing
return one_hot(y_ - 1)
y_train_path = DATASET_PATH + TRAIN + "y_train.txt"
y_test_path = DATASET_PATH + TEST + "y_test.txt"
y_train = load_y(y_train_path)
y_test = load_y(y_test_path)
#-------------------------------------------206

# Training (maybe multiple) experiment(s)
#-------------------------------------------n_layers_in_highway = 4
n_stacked_layers = 4
trial_name = "{}x{}".format(n_layers_in_highway, n_stacked_layers)
for i in range(0, taskCount):
for learning_rate in [0.001]: # [0.01, 0.001, 0.0001]:
for lambda_loss_amount in [0.005]:
for clip_gradients in [15.0]:
print ("learning_rate: {}".format(learning_rate))
print ("lambda_loss_amount: {}".format(lambda_loss_amount))
print ("")
class EditedConfig(Config):
def __init__(self, X, Y):
super(EditedConfig, self).__init__(X, Y)
# Edit only some parameters:
self.learning_rate = learning_rate
self.lambda_loss_amount = lambda_loss_amount
self.clip_gradients = clip_gradients
# Architecture params:
self.n_layers_in_highway = n_layers_in_highway
self.n_stacked_layers = n_stacked_layers
# # Useful catch upon looping (e.g.: not enough memory)
# try:
#

accuracy_out, best_accuracy = run_with_config(EditedConfig)
207

# except:
#

accuracy_out, best_accuracy = -1, -1

accuracy_out, best_accuracy, f1_score_out, best_f1_score = (
run_with_config(EditedConfig, X_train, y_train, X_test, y_test)
)
print (accuracy_out, best_accuracy, f1_score_out, best_f1_score)
with open('{}_result_HAR_6.txt'.format(trial_name), 'a') as f:
f.write(str(learning_rate) + ' \t' + str(lambda_loss_amount) + ' \t' +
str(clip_gradients) + ' \t' + str(
accuracy_out) + ' \t' + str(best_accuracy) + ' \t' + str(f1_score_out) + '
\t' + str(best_f1_score) + '\n\n')
print ("________________________________________________________")
print ("")
print ("Done.")
# In[ ]:
data/download_datasets.py : Same as TensorFlow Data

208

APPENDIX C – TENSORFLOW SETUP
C.1 TensorFlow Installation
In the previous APPENDIX-A, we have already installed NVIDIA GPUs in the big
machine. Then we have successfully installed NVIDIA driver for the GPUs along with
CUDA and cuDNN in the machine. We are starting this section, with the prerequisite of
all previous installations.
Step-1: Check if your machine having conda installed previously by the command.
hpcmonster369@hpc369-Z10PE-D16-WS:~$ conda --version
If not installed, Please download the Anaconda from the website.
https://www.anaconda.com/download/#linux
Step-2: Go to the download folder and verify the md5sum value of the downloaded
Anaconda copy with the link given below.
md5sum Anaconda3-5.3.0-Linux-x86_64.sh
4321e9389b648b5a02824d4473cfdb5f Anaconda3-5.3.0-Linux-x86_64.sh
Verify with below link as having same md5sum #
http://docs.anaconda.com/anaconda/install/hashes/Anaconda3-5.3.0-Linux-x86_64.shhash/
If both having same md5sum values, then you have downloaded the software correctly.
Step-3: Install Anaconda3.
bash Anaconda3-5.3.0-Linux-x86_64.sh
209

Step-4: Go to the installation page, accept the Anaconda3 license after installation.
Step-5: It is recommended to select yes to prepend Anaconda3 install location to the path
in your bashrc file. You can create a backup of your bashrc file before clicking on yes for
safety purpose.
Step-6: Activate the installation by using below command.
source ~/.bashrc
Step-7: Verify Installation → conda list

C.2 TensorFlow Environment Setup
Step-1: Create the environment
hpcmonster369@hpc369-Z10PE-D16-WS:~$ conda create --name
rnn_lstm_har_tensorflow tensorflow-gpu
Step-2: Activate the environment
hpcmonster369@hpc369-Z10PE-D16-WS:~$ conda activate rnn_lstm_har_tensorflow
Step-3: Add Dependencies
conda install numpy
conda install keras
conda install pandas
conda install matplotlib
conda install scipy scikit-learn
conda install nb_conda
Step-4: Check Available Jupyter Kernel
210

hpcmonster369@hpc369-Z10PE-D16-WS:~$ jupyter kernelspec list
Step-5: Validate the environment
conda info –envs
Once all steps completed successfully, the environment would display as Figure. 73

Figure. 73 TensorFlow Project Screen

211

APPENDIX D – PYTORCH SETUP
D.1 PyTorch Installation
In the previous APPENDIX-A, we have already installed NVIDIA GPUs in the big
machine. Then we have successfully installed NVIDIA driver for the GPUs along with
CUDA and cuDNN in the machine. In the APPENDIX-C, we have already installed
Anaconda3 in the machine. We are starting this section, with the prerequisite of all
previous installations.

D.2 PyTorch Environment Setup
Step-1: Create the environment
hpcmonster369@hpc369-Z10PE-D16-WS:~$ conda create -n rnn_lstm_har_pytorch
python=3.6
Step-2: Activate the environment
hpcmonster369@hpc369-Z10PE-D16-WS:~$ conda activate rnn_lstm_har_pytorch
Step-3: Add Dependencies
conda install pytorch=0.4.1 cuda90 -c pytorch
conda install torchvision -c pytorch
conda install matplotlib
conda install -c conda-forge tqdm
conda install nb_conda
Step-4: Validate the environment
conda info --envs
Once all steps completed successfully, the environment would display as Figure. 74
212

Figure. 74 PyTorch Project Screen

213

APPENDIX E – RASPBERRY PI CLUSTER SETUP
E.1 Raspberry Pi Parts
1. Raspberry Pi 3 Model B+ motherboard
2. Samsung 32 GB Class 10 MicroSD card
3. 2.5A Power Adapter
4. 2 Heat sinks
5. MicroSD USB Reader (Optional)
6. Premium Case (Optional)
7. Premium HDMI Cable (Optional)

E.2 Individual Raspberry Pi Installation
Step-1: 1. Install “Raspbian Stretch with desktop” Kernel Version 4.14 from the official
Raspberry PI website link given below.
https://www.raspberrypi.org/downloads/raspbian/
Step-2: Download and Install the Etcher (Linux x86 version) which will burn the
Raspbian image to the Micro SD card. The link given below.
https://www.balena.io/etcher/
Step-3: Get a microSD card adapter and fire up the Etcher so that all the microSD cards
having Kernel version 4.14.
Step-4: To enable SSH remote access from the Pi, open the “boot” drive on the microSD
card and create an empty file named ssh with no extension. Open the folder in a shell and
run below command.
$ touch ssh
214

Step-5: To build a package that supports all Raspberry Pi devices—including the Pi 1 and
Zero use the below command in your Linux machine which will build a .whl package for
installation in Raspberry pi.
$ tensorflow/tools/ci_build/ci_build.sh PI \
tensorflow/tools/ci_build/pi/build_raspberry_pi.sh PI_ONE
For updated version, please use the below link
https://www.tensorflow.org/install/source_rpi
Step-6: Copy the wheel file to the Raspberry Pi and install with pip with appropriate
version number.
$ pip install tensorflow-version-cp34-none-linux_armv7l.whl
Step-7: Connect all the Raspberry Pi to a switch which is connected to a network.
Step-8: Login to Pi using the default password raspberry.
Step-9: Go to the raspi-config to do the rest of the setup.
pi@raspberrypi~$ sudo raspi-config
1. Change the password of default to your own convenient one.
2. Set the locale and timezone.
3. Rename each pi from the default name to rpi# as per the nodes going to be used in the
cluster. You can do that from the configuration file itself and restart the pi.
3. Set the hostname in each pi.
sudo hostname node01

# whatever name you chose

sudo nano /etc/hostname

# change the hostname here too

sudo nano /etc/hosts

# change "raspberrypi" to "node01"
215

4. In raspi-config, check whether the ssh mode is enabled or not. If not enable it.
5. Change the assigned memory for GPU to minimum.
6. Change the assigned memory for CPU to maximum.
7. In raspi-config, change (3. Boot Options > B2 Wait for Network at Boot) from “No” to
“Yes”. This will ensure that networking is available before the fstab file mounts the NFS
client.
8. Restart the Pi by below command.
sudo reboot
9. Repeat the process for all Pis.
10. For password less entry into each Pi, you can generate SSH keys for all nodes are
distributing the public keys of each node to the rest of the nodes. Please the link to
generate SSH keys. After that update /etc/hosts of each node with ip address of rest node.
https://www.raspberrypi.org/documentation/remote-access/ssh/passwordless.md
Step-10: To work on a cluster we need to set up the NFS server and client on master node
and NFS client set up on all the worker nodes which is already mentioned in 5.2 Cluster
of Raspberry Pis Setup section.

E.3 Installation of Program
Step-1: Run the server.py program on the each Raspberry Pi node.
Step-2: The master node should contain both the files lstm_architecture.py and
Config_Dataset_HAR.py with proper dataset folder inside the NFS server directory.
Step-3: The folder structure of the dataset is already mentioned in APPENDIX-A.
Step-4: Run the Config_Dataset_HAR.py on the master node to start the deep model
iterating.
216

REFERENCES

[1]

Yann LeCun, Yoshua Bengio, Geoffrey Hinton (2015), “Deep learning” , Nature,

May 28, Volume 521(7553), p.436–444, doi:10.1038/nature14539.
[2]

“The Next Generation of Machine Learning Chips”, Deloitte Global, December,

2017
[3]

Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals (2015), “Recurrent Neural

Network Regularization” , Conference paper at ICLR, arXiv:1409.2329
[4]

Sepp Hochreiter, Jürgen Schmidhuber (1997), “Long short-term memory”,

Neural Computation Volume 9, Issue 8, November 15, p.1735-1780,
doi:10.1162/neco.1997.9.8.1735.
[5]

Samuel, Arthur L. (1959). "Some Studies in Machine Learning Using the Game

of Checkers". IBM Journal of Research and Development. 44: 206–226. CiteSeerX
10.1.1.368.2254. doi:10.1147/rd.441.0206.
[6]

https://en.wikipedia.org/wiki/GeForce_256#cite_note-2

[7]

https://ai.googleblog.com/2017/04/federated-learning-collaborative.html

[8]

Li, Mu and Andersen, David G. and Park, Jun Woo and Smola, Alexander J. and

Ahmed, Amr and Josifovski, Vanja and Long, James and Shekita, Eugene J. and Su, BorYiing (2014), “Scaling Distributed Machine Learning with the Parameter Server”,
Proceedings of the 11th USENIX Conference on Operating Systems Design and
Implementation, Broomfield, CO, p.583-598,
http://dl.acm.org/citation.cfm?id=2685048.2685095

217

[9]

Uthayasankar Sivarajah, Muhammad Mustafa Kamal, Zahir Irani, Vishanth

Weerakkody (2017), “Critical analysis of Big Data challenges and analytical methods”,
Journal of Business Research, Volume 70, January, Pages 263-286.
[10]

Dobre, Ciprian, Xhafa, Fatos (2014), “Intelligent services for Big Data science”,

Future Generation Computer Systems, vol. 37, p.267-281,
doi:10.1016/j.future.2013.07.014
[11]

Mayer-Schönberger, V., & Cukier, K. (2013) “Big data: A revolution that will

transform how we live, work, and think. Boston”, MA, Houghton Mifflin Harcourt,
PsycINFO Database Record
[12]

Alex Krizhevsky and Sutskever, Ilya and Hinton, Geoffrey E (2012), “ImageNet

Classification with Deep Convolutional Neural Networks”, Advances in Neural
Information Processing Systems 25 NIPS 2012, p. 1097-1105,
http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neuralnetworks.pdf
[13]

Jeffrey Dean,Matthieu Devin,Sanjay Ghemawat,Ian J. Goodfellow

(2016),"TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed
Systems", CoRR, vol. abs/1603.04467
[14]

Ketkar, Nikhil (2017). "Introduction to PyTorch". Deep Learning with Python.

Apress, Berkeley, CA. pp. 195–208, doi: 10.1007/978-1-4842-2766-4_12. ISBN
9781484227657.
[15]

Tianqi Chen, Mu Li, Yutian Li (2015),"MXNet: A Flexible and Efficient Machine

Learning Library for Heterogeneous Distributed Systems", CoRR, vol.abs/1512.01274,
http://arxiv.org/abs/1512.01274
[16]

Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin,

Joseph M. Hellerstein (2012), “Distributed GraphLab: A Framework for Machine
218

Learning and Data Mining in the Cloud”, Proc. VLDB Endow, Vol 5, No. 8, doi
=10.14778/2212351.2212354
[17]

Joseph Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, Carlos Guestrin

(2012). "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs."
Proceedings of Operating Systems Design and Implementation (OSDI).
[18]

Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin and J. Hellerstein.

“GraphLab: A New Framework for Parallel Machine Learning”, In the 26th Conference
on Uncertainty in Artificial Intelligence (UAI), Catalina Island, USA, 2010
[19] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le,M. Mao, M. Ranzato, A.
Senior, P. Tucker, K. Yang, and A. Ng. “Large scale distributed deep networks.” In
Neural Information Processing Systems, 2012.
[20]

Dean LG, Kendal RL, Schapiro SJ, Thierry B, Laland KN (2012), “Identification

of the social and cognitive processes underlying human cumulative culture.”, Science.
2012; 335(6072):1114–1118. doi:10.1126/science.1213969
[21]

Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally (2018), “Deep

gradient compression: Reducing the communication bandwidth for distributed training”,
ICLR 2018
[22]

Bengio Y, Simard P, Frasconi P.(1994),"Learning long-term dependencies with

gradient descent is difficult",IEEE Trans Neural Netw. 1994; 5(2):157-66., PMID:
18267787 DOI: 10.1109/72.279181
[23]

Karanbir Singh Chahal, Manraj Singh Grover, Kuntal Dey (2018),"A Hitchhiker's

Guide On Distributed Training of Deep Neural Networks",CoRR, vol.abs/1810.11787,
http://arxiv.org/abs/1810.11787
[24]

Ioannis Mitliagkas, Ce Zhang, Stefan Hadjis, Christopher Ré, "Asynchrony begets

Momentum, with an Application to Deep Learning", NIPS 2016, arXiv: 1605.09774
219

[25]

J. Yang, J. Lee, and J. Choi (2011), “Activity Recognition Based on RFID Object

Usage for Smart Mobile Devices,” J. Comput. Sci. Technol., vol. 26, no. 2, pp. 239–246,
Mar. 2011.
[26]

Tim Salimans, Diederik P. Kingma (2016), “Weight Normalization: A Simple

Reparameterization to Accelerate Training of Deep Neural Networks”, NIPS 2016
[27]

S. Lasecki, Walter & Chol Song, Young & Kautz, Henry & P. Bigham, Jeffrey.

(2013), “Real-time crowd labeling for deployable activity recognition”, Proceedings of
the ACM Conference on Computer Supported Cooperative Work, CSCW. 1203-1212.
10.1145/2441776.2441912.
[28]

Sumit Majumder, Emad Aghayi, Moein Noferesti, Hamidreza Memarzadeh-

Tehran, Tapas Mondal, Zhibo Pang, M. Jamal Deen (2017), “Smart Homes for Elderly
Healthcare—Recent Advances and Research Challenges” , Sensors (Basel). 2017 Nov;
17(11): 2496, doi: 10.3390/s17112496.
[29]

L. Chen, C. D. Nugent, and H. Wang (2012), “A KnowledgeDriven Approach to

Activity Recognition in Smart Homes,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 6,
pp. 961–974, Jun. 2012.
[30]

Y.-J. Chang, S.-F. Chen, and J.-D. Huang (2011), “A Kinect-based system for

physical rehabilitation: a pilot study for young adults with motor disabilities,” Res. Dev.
Disabil., vol. 32, no. 6, pp. 2566–2570, 2011.
[31]

N. Alshurafa, W. Xu, J. Liu, M.-C. Huang, B. Mortazavi (2013), C. Roberts, and

M. Sarrafzadeh, “Designing a Robust Activity Recognition Framework for Health and
Exergaming using Wearable Sensors.,” IEEE J. Biomed. Heal. Informatics, no. c, pp. 1–
11, Oct. 2013.
[32]

Fish, Ram David Adva; Messenger, Henry; Baryudin, Leonid; Dardashti, Soroush

Salehian; Goldshtein, Evgenia , "Fall detection system using a combination of
220

accelerometer, audio input and magnetometer", Patent No. 9648478 , Filing Date: 21
August 2014
[33]

B. Lange, C.-Y. Chang, E. Suma, B. Newman, A. S. Rizzo (2011), and M. Bolas,

“Development and evaluation of low cost game-based balance rehabilitation tool using
the Microsoft Kinect sensor.,” Conf. Proc. IEEE Eng. Med. Biol. Soc., vol. 2011, pp.
1831–4, Jan. 2011.
[34]

K. Yoshimitsu, Y. Muragaki, T. Maruyama, M. Yamato, and H. Iseki (2014),

“Development and Initial Clinical Testing of ‘OPECT’: An Innovative Device for Fully
Intangible Control of the Intraoperative Image-Displaying Monitor by the Surgeon,”
Neurosurgery, vol. 10.
[35]

B. Mirmahboub, S. Samavi, N. Karimi, and S. Shirani (2012), “Automatic

Monocular System for Human Fall Detection based on Variations in Silhouette Area.,”
IEEE Trans. Biomed. Eng., no. c, pp. 1–10, Nov. 2012.
[36]

Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-

Ortiz (2013), “A Public Domain Dataset for Human Activity Recognition Using
Smartphones” , ESANN 2013 proceedings, European Symposium on Artificial Neural
Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 24-26
April 2013.
[37]

UCI Human Activity Recognition Using Smartphones Data Set,

“https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphon
es”
[38]

W. Ong, L. Palafox, and T. Koseki (2013), “Investigation of Feature Extraction

for Unsupervised Learning in Human Activity Detection,” Bull. Networking, Computer.
System Software, vol. 2, no. 1, pp. 30–35, 2013.

221

[39]

Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton (2016), “Layer

Normalization”, NIPS 2016
[40]

Sergey Ioffe, Christian Szegedy (2015), “Batch normalization: Accelerating deep

network training by reducing internal covariate shift.” In International Conference on
Machine Learning, pages 448–456, 2015.
[41]

David E. Rumelhart, Geoffrey E. Hinton & Ronald J. Williams (1986), “Learning

representations by back-propagating errors”, Nature, volume 323, pages533–536 (1986),
https://doi.org/10.1038/323533a0
[42]

Hava T. Siegelmann (1996), “Recurrent neural networks and finite automata”,

Computational Intelligence, Volume 12, Number 4, 1996
[43]

Minsky, Marvin L. (1967), “Computation: Finite and Infinite Machines”,

Prentice-Hall, Inc., ISBN = 0-13-165563-9
[44]

SiegelmannH.T.SontagE.D.(1995), “On the Computational Power of Neural

Nets” , Journal of Computer and System Sciences ,Volume 50, Issue 1, February 1995,
p.132-150
[45] Hava T Siegelmann, Eduardo D Sontag (1994), “Analog computation via neural
networks*”, Theoretical Computer Science, Volume 131, Issue 2, 12 September 1994,
p.331-360, https://doi.org/10.1016/0304-3975(94)90178-3
[46]

Hava T. Siegelmann and Eduardo D. Sontag (1991), “Turing Computability with

Neural Nets”, Applied Mathematics Letters, Vol. 4, p.77-80
[47]

Mozer, M. C. (1995). "A Focused Backpropagation Algorithm for Temporal

Pattern Recognition". In Chauvin, Y.; Rumelhart, D. “Backpropagation: Theory,
architectures, and applications.” ResearchGate. Hillsdale, NJ: Lawrence Erlbaum
Associates. pp. 137–169. Retrieved 2017-08-21.

222

[48]

Bengio, Y & Frasconi, Paolo & Simard, Patrice (1993), “Problem of learning

long-term dependencies in recurrent networks”, 1993 IEEE International Conference on
Neural Networks. 1183 - 1188 vol.3, doi:10.1109/ICNN.1993.298725.
[49]

Cho, Kyunghyun; van Merrienboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry;

Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014). "Learning Phrase
Representations using RNN Encoder-Decoder for Statistical Machine Translation". arxiv:
1406.1078 [cs.CL].
[50]

S. Vosoughi, P. Vijayaraghavan, and D. Roy (2016), “Tweet2vec: Learning tweet

embeddings using character-level CNN-LSTM encoder-decoder,” CoRR , vol.
abs/1607.07514.
[51]

Michaela Blott, Ling Liu, Kimon Karras, Kees A Vissers (2015),

“Scaling Out to a Single-Node 80Gbps Memcached Server with 40Terabytes of
Memory”, In HotStorage ’15.
[52]

Priya Goyal, Piotr Doll, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski,

Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He (2017), “Accurate, Large
Minibatch SGD: Training ImageNet in 1 Hour”, CoRR, vol abs/1706.02677,
http://arxiv.org/abs/1706.02677
[52]

Peter H. Jin, Qiaochu Yuan, Forrest N. Iandola, Kurt Keutzer (2016), “How to

scale distributed deep learning?”,CoRR, vol.abs/1611.04581
[53]

Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally, "Deep Gradient

Compression: Reducing the Communication Bandwidth for Distributed Training",CoRR
,vol.abs/1712.01887, 2017
[54]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image

recognition,” arXiv preprint arxiv: 1512.03385, 2015.

223

[55]

Jaeyoung Kim, Mostafa El-Khamy, Jungwon Lee (2017), “Residual LSTM:

Design of a Deep Recurrent Architecture for Distant Speech Recognition”,
INTERSPEECH 2017 August 20–24, 2017, Stockholm, Sweden,
[56]

A. Emin Orhan and Zachary Pitkow (2017), “Skip Connections Eliminate

Singularities”, Orhan2017SkipCE
[57] Y.-y. Chen, Y. Lv, Z. Li, and F.-Y. Wang (2016), “Long short-term memory model
for traffic congestion prediction with online open data,” in Intelligent Transportation
Systems (ITSC), 2016 IEEE 19th International Conference on. IEEE, 2016, pp. 132–137.
[58] Zhiyong Cui, Student Member, Ruimin Ke, Student Member, Yinhai Wang (2018),
“Deep Stacked Bidirectional and Unidirectional LSTM Recurrent Neural Network for
Network-wide Traffic Speed Prediction” , CoRR, vol.abs/1801.02143,2018
[59] Yu Zhao and Rennong Yang and Guillaume Chevalier and Maoguo Gong (2017),
“Deep Residual Bidir-LSTM for Human Activity Recognition Using Wearable Sensors”,
CoRR , doi:abs/1708.08989
[60]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi

(2016), "Google’s Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation.

224

Media of

Distributed Deep Learning Models: Using TensorFlow and PyTorch on NVIDIA GPUs and Cluster of Raspberry PIs