Transformer-based neural networks


One technology that largely survived the AI winters was that of the neural networks. While there have been reductions in interest and funding, their development seemed largely unaffected by the fortunes of AI in general. Although changes in approaches to data handling and knowledge representations allowed for the current successes in AI, it was the development of neural networks that created a computer system on which these representations could be deployed.

Alongside the rule-based approaches of the 1940s were developments around the idea of the ‘perceptron’ – a mathematical (and programmable) version of an artificial neuron. This could take a combination of data values as input, apply a computation and produce one or more values as outputs. This captured the Cybernetic view that a computer circuit of wires and hardware could mimic the behaviour of flesh and blood organisms.  

A block diagram for a single perceptron is shown below.

A diagram of a perceptron. On the left side in a vertical row are four words (from top to bottom): input 1, input 2, input 3 and input 4. A horizontal line connects the word ‘input 1’ to the word ‘parameter 1’ on the right; a horizontal line connects the word ‘input 1’ to the word ‘parameter 2’ on the right: a horizontal line connects the word ‘input 3’ to the word ‘parameter 3’ on the right; and a horizontal line connects the word ‘input 4’ to the word ‘parameter 4’ on the right. There is a dotted vertical line connecting the lines between input / parameter 3 and input / parameter 4. All four ‘parapmeter’ words have an arrow from the right side of each circle pointing right, and they all lead to the same large circle which has the words ‘combining function’ in the middle. There is an arrrow from the right side of the circle pointing right, to the word ‘output’.

Drawn by Kevin Waugh, OU course author, using Word draw.

Single perceptrons really did very simple tasks, for example detecting if a sheet of paper had a dark left side or dark right side. It was only when combined in larger numbers were they able to do ‘useful’ tasks. An early breakthrough was the ability to differentiate between circles and squares drawn on paper.

By combining neurons into interconnected networks and adjusting the computation applied by each individual neuron, the resulting system can learn to perform classification tasks. Given an image, for example, it could decide if the image represented a particular shape. To hand-tune the parameters for a perceptron would be too time-consuming, so techniques were developed where the perceptron was able to tune itself – a rudimentary form of learning.

Below is a block diagram of a single layer neural network – it’s the single hidden layer that makes this a single layer network.

On the left hand side are four circles, in a vertical column. There are four horizontal arrows pointing to the left side of each circle. Underneath the column is the phrase ‘input layer’. In the middle is a vertical column of five circles, with the phrase ‘hidden layer’ underneath the column. Multiple lines connect each of the circles on the left to the circles in the middle. On the right hand side is a vertical column of three circles, with the phrase ‘output layer’ underneath the column. Multiple lines connect each of the middle circles to each of the right hand circles.

Drawn by Kevin Waugh, OU course author, using Word draw.

At the start, progress on perceptrons was slow; early ‘single layer’ neural nets were quite disappointing in what they could achieve. Developments in multi-layer networks however showed great promise and a technique known as backpropagation (that looped results back into early layers in the network) gave them the ability to undergo training, improving the performance of a task by the use of positive and negative reinforcement that adjusted the internal parameters of individual neurons within the network to improve the performance.

Over the years, significant milestones were met. In 1957, a perceptron-based system was taught to recognise basic shapes in images using techniques we now know as Deep Learning. In 1992 a neural network was trained to play backgammon – it had approximately 25,000 parameters and was trained on over 1.5 million games in which the system played itself.

One weakness of neural networks is that they are computationally expensive and require large amounts of training data. Developments in parallel processing, particularly the creation of Graphical Processing Units (GPUs), solved some of the hardware issues. While the growth of content on the Internet provided massive amounts of data that could be used for training purposes.

The final step required on the path to the Generative AI technologies required another neural network development – the development of transformers.

Prior to the development of the transformer, neural networks tended to treat each input element (say words in a sentence) in a linear manner. For example, processing the words in a sentence one at a time, left to right in the input. Transformer-based neural networks work by identifying and treating distinct parts of the whole input, such as a sentence, and relating it to the words around it. As if it was able to see the whole sentence at once while giving attention to specific parts of the sentence. It could therefore see relationships between words within the sentence and with other sentences in the input. 

Transformer-based neural networks are able to understand the context between input elements and previously processed data. They start to become similar to the hand-crafted, static, semantic webs from the knowledge management era of AI – but this time they are being built for each input and can self adapt as more information is known.  

Generative AI systems harness transformer-based neural networks to decide what is semantically important about concepts in the text, speech, or image input, and create novel outputs based on those semantics. GenAI proved to be highly effective at integrating large amount of input, giving attention to key elements within that input, and generating human-like content – they mimic human abilities very well.

Go back to the main course content