Suppose you’d like to classify individual documents at multiple levels of specificity. In addition, you’d also like to know whether a document contains multiple topics and with what confidence. For example, as I write this Google News is displaying an article titled Income Stocks With A Trump Tax Bonus. We may want to capture the main topics contained in the article along with an associated measure of our confidence that those topics are contained in the article. Such a classification might look something like the following

1. Finance: 70%
a. Stock Market: 63% (of the 70%)
b. Taxes: 37% (of the 70%)
2. Politics: 30%
a. Executive Branch:100%
The above topical classification leads to a number of questions:
  • How do we decide when to include a given category in our list?
  • How do we decide whether we should go down another level in a given category?
  • How can we train a system to be both general enough to pick out a meaningful category from a broad variety of documents and specialized enough to potentially dive deeper into each of these categories?

We need to find a way to efficiently represent and learn a hierarchy of knowledge.


Big Idea: Hierarchical Attention Networks

One of the primary advantages of neural networks is their ability to automatically learn features in the data that are important for making accurate predictions. One neural network component that is dominating in natural language processing tasks is the bidirectional LSTM with attention (BLSTM-A). It is also common to use a GRU in place of the LSTM, as GRUs have been shown to perform comparably to LSTMs and they are faster to train.  

CMU and Microsoft Research released a paper in 2016 titled “Hierarchical Attention Networks for Document Classification,” which proposed a clever way of combining BGRU-A components to learn document representations. The authors observed that documents are composed of sentences, which are themselves composed of words, and wanted to encode this inherent compositionality into the design of their neural network architecture. They did this by applying a BGRU-A, the word encoder, over the word tokens for each sentence in a given document, resulting in a vector representation of each sentence. The sequence of sentence vectors are then fed into another BGRU-A, the sentence encoder, to obtain a single vector representation of the document. The figure below illustrates this at a high level for a document containing two simple sentences. Color denotes identity—the same word encoder component is used for processing the word tokens in sentences one and two.


Figure 1. Document Processing Stages. High-level illustration showing the main processing stages for a document both before and within the network. As a preprocessing step, documents are segmented by sentence, and then by words, before being fed to the network. The same word encoder (BLSTM-A) is applied to each of the tokenized sentences. The vector representation of each sentence is then fed to a sentence encoder (BLSTM-A), whose outputs can be projected to obtain the logits and eventual prediction.

Hierarchical Attention Networks at Forge.AI

At Forge, we have two types of hierarchies:

  1. Data hierarchy: The labels for our classification tasks exist in a pre-defined hierarchy.
  2. Model hierarchy: The architecture of the model is itself hierarchical.


Data Hierarchy

Our training data consists of documents organized into a directory hierarchy, where the name of a given directory defines a label for all documents beneath it. Since we have a nested directory structure, this means a document may have more than a single label. Specifically, the set of possible labels for a given document consists of the directory names along the path from the root directory to the location of the document.

We associate a unique model with each directory in the data hierarchy, including the root directory. The only exceptions are directories that do not contain at least two subdirectories, since we have no reason to train a model on one or zero labels. An example directory hierarchy is shown below.


Figure 2. Organiztion of Training Data. Example illustration of how the training data is organized into a label hierarchy. A unique model is training on each node of the data hierarchy. The three shaded regions shown here would correspond to three models: one trained on all documents, one on all Finance documents, and one on all Earnings documents, respectively. This gives us a tree of models, each of which are specialized in their own region of the data hierarchy.


In this example, the root-level model is trained to label a document as either Finance, Politics, or one of the omitted remaining labels (denoted by the ellipsis). Depending on the level of confidence for the prediction, we may want to pass the document to models in the next level. If the root-level model predicts the document to belong under Finance, we can send that document to the model trained exclusively on documents within Finance, which will output either Dividends, Earnings, or Finance_Other. Beneath every category is a special label reserved for documents that could not be placed neatly into any further subcategories. For example, the documents placed directly under Finance in the illustration above would be labeled as Finance_Other, where we prepend the parent name to enforce the constraint that no two categories have the same name. We can continue this process down the prediction path until our stopping condition is met (e.g. receiving a prediction confidence lower than 80 percent).

We can explore multi-topic documents by traversing, for example, the prediction paths for the top two predicted labels. Another option may be traversing down into any subcategory that receives a prediction confidence above some predefined threshold. Note that if we want to increase the average number of paths taken, we must decrease our confidence threshold. For example, if our threshold is above 0.5, we will never traverse down more than one path. This illustrates why one must be careful when interpreting the output confidence values, which are just softmax probabilities, of a network being fed documents that may very well contain more than one prominent topic.

Consider the ideal case where a model has been trained to accurately identify when two topics, topic A and topic B, occur in a document. If we feed the model a document whose contents are evenly split between topic A and topic B, how should we interpret the set of output predictions “topic A: 50%, topic B: 50%”? In this particular example, one could argue that the network is not confused and/or maximally uncertain about the topic to assign, but rather that it has observed signals of topic A and topic B in the same proportions and is communicating that accurately in its outputs. The problem, however, is that it’s not possible in a general setting to determine whether a model is uncertain about which label to assign or whether it is relatively certain that a handful of topics are present.


Model Hierarchy

Earlier, we provided a high-level overview of the BLSTM-A component and how it is used in the hierarchical attention network. Here, we describe the full model architecture in more detail.  Below is a diagram of a BLSTM-A being used at the word-encoder  level. This is also the unrolled/unfolded representation where, instead of showing the self-recurrent LSTM loop pointing back on itself, identical copies of the network are shown at each step along the input sequence. Again, color denotes identity, so any components with the same color are exact copies of each other, shared weights and all.


Figure 3. High-level illustration of BLSTM-A at the word encoder level. Each output of the BLSTM is fed through a fully-connected layer with tanh activation function, which also projects the outputs to the same dimensionality as the word importance vector. The inner product between the word importance vector and each output is sent through a softmax layer, which outputs the normalized (across timesteps) attention weights. The sentence vector is then computed as the weighted sum of the BLSTM outputs.


The bidirectional LSTM gives us context-aware word representations and the attention layer outputs a weighted sum of these representations to give a sentence representation. The attention mechanism operates by scoring each element of the BLSTM output sequence with a word importance vector. The word importance vector takes the place of what’s often called the query in traditional attention mechanisms. The goal is to learn a representation for the importance vector such that its inner product with a context-aware word vector (output from the BLSTM) yields some measure of the word’s importance. Since the importance vector is trained jointly with the rest of the model parameters, the meaning of a given word’s importance is defined implicitly through the training taskimportant words are ones that appear predominantly for some particular label. Formally, the context-aware vector (BLSTM output state) for the  tth word in the ith sentence is denoted by hit and



$$u_{it} = \tanh(W_w h_{it} + b_w) %1$$



$$\alpha_{it} = \frac{ \exp(u_{it}^T u_w) }{ \sum_{t'}  \exp(u_{it'}^T u_w)   }%1$$



$$s_i = \sum_t \alpha_{it} h_{it}  %1$$



with si being the sentence vector representation output from BLSTM-A. This results in a sentence vector for each sentence in the document. The same process is then run over these vectors with a different BLSTM-A, called the sentence encoder (just to distinguish it from the word encoder), scoring the importance of each sentence relative to one another. Similar to word importance, important sentences are those containing a particular pattern of words that appear predominantly in documents of a particular label. This reduces the chances of our model getting thrown off by, say, some politically-charged word occurring in a document that’s otherwise about home cooking. If an entire portion of the document goes off on a political rant, however, this is good reason for the model to increase its relative confidence that the document may be labeled under Politics.

Finally, it’s worth noting that detection of important words/sentences occurs before prediction of the document label, something also emphasized by the authors in the original paper. It can be easy to accidentally flip the direction of information flow when we think about these models, which is why the aforementioned distinction is made. Rather than using the label to figure out the important words, the model is first finding objectively informative word patterns that will then be used when determining the label downstream. We know that mini-batch gradient descent updates the model weights by first averaging the gradients for each example in the batch. The effect of this average, and the incremental updates across batch averages, is that we learn representations for the importance vectors that give high scores to recurring patterns in the data that would have been generally useful for determining the document label. We average out noisy signals and compound important ones. Indeed, the representations we observe for the importance vector can take on very interesting values, as seen in the example below.


Figure 4. Classifier Importance Vectors. The left image is the word importance vector and the right image is the sentence important vector. The horizontal axis corresponds to the values of the vector elements and the axis going into the page is the training step, so we are seeing the distribution of values in the vector in  fifty step intervals. For more plots showing the classifier performance as a function of training steps, please see the Appendix at the end.


Implementation Notes

In contrast with the original paper, we initialize word embeddings to pretrained GloVe embeddings instead of training them from scratch. The pretrained embeddings were trained by the Stanford NLP Group on six billion tokens from Wikipedia and Gigaword. We found this both increased performance and decreased training time.

Individual documents are converted into a matrix representation such that the entry in row i and column j corresponds to the jth word of the ith sentence in the document. Zero-padding is inserted such that the matrix has a number of columns equal to the number of words in the longest sentence. Note, however, that we are flexible with respect to number of sentences and words in a batch of documents, allowing for variable-length number of sentences and lengths of longest sentence per-document. This is accomplished by dynamically unfolding the sequences with TensorFlow’s tf.while_loop function, in conjunction with dynamic tf.TensorArray objects.

In order to support fast training times and flexibility during serving, the model consumes serialized TFRecords protobufs for training and raw tensors during serving. Such a split is possible by utilizing the notion of model signatures in TensorFlow, which allow one to define input/output modalities and associate them with unique identifiers.

The models themselves are serialized to binary protocol buffers. At serving time, these are loaded with the TensorFlow C++ API, using the minimal set of operations required to load a model into memory and feed tensors to input layers. At startup, the server loads copies of the full model tree into a thread-safe queue with one model tree for each available thread . A REST endpoint is exposed to allow clients to issue POST requests containing documents to be classified. Issuing a request triggers the request handler that will pop a model tree off the queue, query it for a hierarchy of predictions, and enqueue the model tree back into the queue when finished.


Attention Visualization of Important Words

For any prediction, we can also extract the words for which the attention layer coefficients were largest. Large coefficients correspond to words and sentences with relatively high scores as determined by the importance vectors. The importance of each word is illustrated by the shade of its corresponding square. The darker the shade, the more important. The model predicted this document to belong under Finance → Earnings with a confidence above 99.99 percent. Note that we are only displaying a portion of the document and thus a portion of the scores, so the absolute values of the scores are not importantthe value of one score relative to another is the focus for the visualization:


classifier-post-wc ex2.png
Figure 5. Word Importance Values for Two Sentences. These examples are showing the top words from two sentences contained in the article, “Alphabet Earnings: What to Watch”.





For purposes of quick illustration, I ran the model on  my laptop on a small dataset containing 2899 documents total, using 80 percent for training (orange) and 20 percent for evaluation (blue) (chosen at random). The entire training process took less than 30 minutes before some early stopping metrics determined a plateau in validation accuracy had been reached. The main hyperparameters include:
  • Batch size: 64
  • Embedding size: 300
  • GRU state size: 256
  • Vocab size: 60000
  • Max sentences considered per document: 60
  • Max number of word tokens to consider per sentence: 80





Per-Class Accuracies


Tags: NLP, Technology, Machine Learning, Data, NLU, AI, hierarchical classifier, attention network, neural network