Rule-based ethos mining
The architecture of the software system for mining ethos consists of three stages, five layers and eight components. The three stages are: the ESE/Non-ESE stage, the +/- ESE stage and the Network stage. The ESE/Non-ESE stage takes an input of cleaned text transcripts and classifies each segment as either an ESE or non-ESE. The +/- ESE stage then gives the polarity of ESEs: ESEs with positive sentiment, corresponding to ethos supports; and ESEs with negative sentiment, corresponding to ethos attack. Finally, the Network stage provides a visualisation of all ESEs as edges between each participant in the debate. In the ESE/Non-ESE stage, there are three layers consisting of five components. The parsing layer uses plain text from the COMMA Corpus test sub-corpus and applies three different methods to it: Named Entity Recognition (NER), Part-Of-Speech (POS) tagging and a set of domain specific rules. The output is Agent Reference Expressions (AREs) which are any statements referring to another person, organisation or agentive entity. Given the dialogical nature of the material, many statements do not refer to the target-person by their name explicitly, but e.g. by a pronoun (e.g “she”), by a region MP represents (e.g. “the honourable Member for Falkirk, East”), by an etiquette formula (e.g. “my honourable friend”), or by a functional role (e.g. “the Prime Minister”. Thus, AREs are then passed to the anaphora layer where both source-person and target-person of the statement are retrieved from the original text. The next challenge is that the repetitions of what has been previously said can be ethotically neutral, especially when an MP wants to remind some thread of the debate which happened many turns earlier. Therefore, full AREs are passed to the reported speech layer where an ARE is removed if it is not an ethotic expression but an instance of reported speech. In the +/- ESE stage there is one layer, the sentiment layer, containing two components, the sentiment classifier and the word lexicons. The sentiment classifier and word lexicon components combine to classify ESEs as positive and negative. These two sets are then passed to the Network stage where the visualisation layer displays relationships between people, organisations and other entities. The techniques of domain specific rules, anaphora resolution, reported speech function and relationship visualisation were developed specifically for the tasks of ethos mining in political debate, and the method of sentiment classification was extended with the development of a lexicon to account for the characteristics of the domain.
To perform sentiment analysis one existing lexicon was used, the sentiment word lexicon (SWL), and one lexicon created, an ethotic word lexicon (EWL). The SWL contains 2,006 words tagged as positive and 4,738 words tagged as negative. The EWL is a set of keywords developed using Thatcher’s Ethos in Hansard corpus training sub-corpus containing 381 tagged sentences with 96 positive and 285 negative from which unigrams, bigrams and trigrams were extracted. Despite the relatively small volume of this set, its advantage lies in its adaptation to sentiment related specifically to ethos in political debate. The removal of non sentiment bearing words and named entities, and the use of n-grams gave 32,858 features overall to be used as training data for machine learning.
Deep learning ethos mining
To extract ethos, we created an NLP pipeline with an input of raw natural language text and an output of +/-ESEs which are applied in the ethos analytics tool. The pipeline consists of modules which are either employing existing techniques; or modules which extend such methods for the purpose of ethos mining; or modules which contain original techniques developed specifically for this paper. The raw text is passed to five areas of the pipeline (see arrows coming top-down from raw text: (1) directly to the DMRNN module; (2) the POS tagger; (3) the universal dependency (UD) tagger; (4) the sentiment classifier; and (5) the anaphora resolution module. Modules (3) and (4) are involved in complex processes. The UD tags are passed to the entity extraction (EXT) module (which removes entity references not relevant for ethos mining) and then to the sentiment presence module (which determines whether a sentence contains a sentiment). The output from the sentiment classifier is passed to the polarity (POL) module which combines the output from the sentiment presence module with the sentiment classifications. Next, the raw text, POS tags, UD tags, EXT output and POL tags are passed as separate inputs into the DMRNN, returning ESEs/n-ESEs. The output of sentiment classification determines +/-ESEs with the anaphora resolution module tagging each +/-ESE with a source and target.
In the period of February 1st 1997 to April 30th 1997, 53 text transcripts were analysed, focusing on the final stages of the Conservative government before Labour leader Tony Blair became Prime Minister. In this time, it was documented that John Major, the then Prime Minister, was struggling to keep his own party on side. This is evident in the analysis with eight ethotic attacks coming from his own party and two attacks coming from Tony Blair, the leader of the opposition at the time, where the average number of attacks is two. Following the loss of the general election to the Labour party a new leader of the Conservatives was elected. Interestingly, in the lead up to the general election, the proposed candidates for the Conservative Leadership election are more prominent in the visualisation where the mean for number of supports and attacks for a politician is two. Many supports and attacks of the potential leaders hint at their impending desire to run for party leadership as a high number of either show that the potential leaders are more prominent in debate.