On Resilience ...

 

By Ahmed Tariq Sheikh, RC3 Postdoctoral Fellow

We are living in challenging times where resiliency is a requirement and not an option anymore. Resiliency has many artifacts including, but not limited to humans resiliency to environment and diseases, crops resiliency against the drought and natural disasters etc. However, the modern era that we live in cannot survive without the perpetual functioning of Information and Communication Technology (ICT) equipment. Our daily chores depend heavily on the correctness of these systems as failure of some of them may lead to catastrophes. Digital resiliency is the need of an hour that can no longer be overlooked without considering the consequences.

My research spans several areas in computing including computer architecture, parallel computer architecture and digital design & synthesis. My primary focus of research is the manifestation of resilient digital systems. This can be achieved (usually) by two approaches:

  • The system-level, and
  • The architectural hybridization.

At the system level the components are replicated to rectify the fault(s). The number of replications is directly proportional to the number of faults the system is supposed to tolerate. The rule of thumb is a system has to tolerate f faults then, there must be 3f+1 replications of that system. My previous research focused on designing fault tolerant systems based on the concept of selective replication. In this scheme, the replication is applied at the lower level of abstraction i.e., at the gate-level or the transistor-level. Only the sensitive modules with high probability of failure are replicated. At the highest level, the system will appear as a single entity, but it is made resilient by applying redundancy at the gate-level or transistor-level. The benefit of this approach is being able to dissect any system to its basic components and applying smart replication. The drawback is the significant design and analysis time.

Architectural hybridization is another approach that has gained lot of momentum. The central concept of this approach is that the underlying hardware is made secure or trustworthy by adding the trusted hardware components. These trusted components work together in a consensual manner to tolerate faults based on the principle of Byzantine Fault Tolerance (BFT). My current research will focus on the investigation of faults in the Multi Processing System-on-a-Chip (MPSoC) and then the implementation of systems that employ BFT, subsequently. The prototypes will mostly be developed and tested on Field Programmable Gate Arrays (FPGA) boards. This research work is in line with the RC3 vision and mission of achieving resiliency and fault tolerance in the security plane and will also serve as a stepping stone towards digital sovereignty of the Kingdom.