Version 22

    The Newcastle Upon Tyne, UK office of Red Hat is predominantly a middleware engineering centre, with core developers from several middleware project teams. In addition to projects in those areas, the following topics are available to local candidates only. All work will be open source.

     

    Notes on application procedure

     

    Candidates should select one preferred project and optionally one reserve choice in which they have particular interest, in order that we can ensure their application is handled by the most appropriate software engineering staff. Notification of the selected projects should be sent by email, along with a C.V. and some source code. All these projects require competence in software engineering, so a body of work that demonstrates design, coding and testing skills is preferred. For postgraduate level candidates, the undergraduate final year project work may be most appropriate. Contributions to existing open source projects will also be considered favorably.  The provided source should be in Java, unless the selected project specifically calls for work in other languages.  The submission will be reviewed by the potential project supervisors and candidates whose work meets the required standard will be invited to a face to face technical interview.

     

    Addendum for 2016/17 onwards: in addition to topics below, candidates may submit their own proposals in the same format i.e. a brief outline of the problem to be addressed, the reason why it's interesting and, most importantly, an explanation of why it's useful to Red Hat.

     

    Notes on project design

     

    The topics below consist, for the most part, of software engineering tasks taken from the 'to do' list of JBoss R&D. That is, they are tasks with direct relevance to the Red Hat middleware software portfolio and likely to be undertaken by JBoss staff as time permits, unless first done by a student. They will be supervised by the software engineer who would otherwise do the work and who therefore has both a good grasp of the field and a vested interest in successful completion of the work.

     

    In contrast to some other industrial placements, most of these project topics require some element of original research. In such cases, students will be expected to become familiar with the state of the art in the relevant field and to contribute to it, whilst producing implementation work to the standard expected of a research prototype. Such projects may be viewed as having the same form and quality standards as a PhD, with much reduced scope to fit the allocated timescale.

     

    Some projects may alternatively focus on software engineering discipline in preference to research, requiring students to produce code built and tested to production quality. These may require delivering features into a live project release on a fixed timeline.  Finally, projects may focus on the practice of community open source, working in an open, agile and collaborative manner. Soft skills and non-code contributions may form a larger part of the assessment in such cases.

     

    And now, the project list. The topics below are offered for academic year 2017/18.

     

    Persistent Memory for Java Middleware

     

    Persistent Memory retains data without power (like HDD/SSD storage), but is byte addressable (like RAM). It thus presents a new programming model, with novel characteristics and benefits that are not yet fully exploited. See e.g. https://www.snia.org/PM  and https://www.usenix.org/system/files/login/articles/login_summer17_07_rudoff.pdf  for background. Whilst some libraries are starting to emerge to provide higher level APIs to these low-level hardware features, see e.g. pmem.io: PMDK , the options for accessing this functionality from Java are still limited. For JavaEE middleware use cases that require fault tolerance, e.g. XA transactions logs, messaging logs, in-memory data grids and databases, PM represents a key opportunity for performance gains.

     

    In this project you will explore ways to utilise persistent memory from Java middleware such as the Narayana transaction engine, Apache ActiveMQ-Artemis messaging system and Infinispan data grid. In addition to Java, some grasp of C will likely be necessary for this project. Experience of JNI, hardware architecture, linux system programming, profiling and benchmarking may be beneficial.

     

     

    GraphQL to Hibernate query adapter

     

    http://graphql.org provides a potentially useful advance over traditional REST API designs, particularly with regard to reducing the number of latency-inducing remote calls a client must make to the server. However, this comes at the cost of greater complexity in implementing an efficient query plan for executing the GraphQL statements against the backend storage. Existing mechanisms e.g. https://github.com/jcrygier/graphql-jpa take a relatively straightforward approach to query translation, with little attention to performance costs.

     

    In this project you will analyse and benchmark existing approaches to executing GraphQL using Hibernate, potentially encompassing both traditional relational database backends and NoSQL db alternatives via HibernateOGM. You will provide design guidelines for best practice guidance to users, as well as prototyping new query execution strategies to address and performance issues you identify. Prior experience of GraphQL, REST, Hibernate and relational or noSQL databases would be beneficial.

     

     

    Web protocol proxy

     

    Proxy servers provide a way to intercept and mutate client/server communication. For web applications this is useful for performance (e.g. caching), security (e.g. blocking malicious content) and service provision (e.g. content translation, ad blocking). Extensible HTTP proxy servers for developing such capabilities in Java already exist, e.g. LittleProxy  but many modern web apps make use of additional communication features such as WebSockets or https://socket.io/ based data streams.

     

    In this project you will develop a Java proxy framework capable of managing WebSocket frames and/or socket.io protocol messages. This will layer on the Netty.IO async I/O framework. Some prior knowledge of asynchronous I/O, network protocols and state machines would be beneficial.

     

     

    Graph assisted maven artifact usage analytics

     

    The Apache Maven build and dependency management tool is widely used by Java developers and is the basis of code packaging and distribution for many open source projects, including most of those lead by JBoss engineers. Analysis of maven repository server logs can therefore give some indication of community uptake of new software releases, providing valuable feedback on adoption to developers. Simple counting techniques can be useful, e.g. number of users of each version over time can indicate how quickly users migrate to new feature/bugfix releases. However, such methods are limited as they fail to account for the dependency relationships between artifacts.

     

    For traffic logs showing a user who requests artifact A, then artifact B soon after, where A is known to declare a dependency upon B, we may infer that the request for B was transitive via A, rather than a direct choice by the user. We may thus present a report with metrics showing how much of a project's usage is attributable to its inclusion via dependencies in various other projects.  This requires building the dependency graph for all artifacts from the maven metadata and storing it in a manner that provides for efficient query during the analysis of the server traffic logs.

     

    In this project you will build a system to facilitate reporting on project popularity by analysis of maven repository server traffic logs, focusing on utilisation of dependency graph information. Prior experience of maven, a suitable analysis platform (e.g.  Apache Spark) and a graph database (e.g. Neo4j) would be beneficial.

     

     

    Wrapper Induction for server log analytics

     

    Server logs are in important source of information for understanding system behavior in support cases. Such logs comprise a statements generated from string templates in the server and application code, but these templates are not always available at analysis time. Therefore it is frequently necessary to use relatively primitive pattern matching tools to extract the useful nuggets of information from log files. Using machine learning techniques to extract the patterns i.e. infer the grammar of the underlying log system, would allow for more efficient data extraction and filtering.

     

    In this project you will study wrapper induction techniques and apply them to build a tool for analysis of server log files.  Some knowledge of language grammars, regular expressions and machine learning techniques may be beneficial.

     

     

    CDC for time traveling queries

     

    Databases systems hold current state, allow queries over it and allow modifications to it. CDC (change data capture) systems present those modifications as a time ordered series of events.  By replaying the CDC event stream forward from a known base position, it's possible to recreate the db state as it existed at any point in time and thus perform queries which produce results as they would have appeared at that instant.  Some databases which use MVCC concurrency control can achieve the same functionality from the opposite direction, selectively overwriting current state with older state to allow queries which behave as though executed in the past, see e.g. Oracle flashback.

     

    For databases that don't natively support queries over prior state, it is desirable to add this functionality through an adapter layer or driver which uses the CDC event stream to patch query results.  Whilst intractably complex for many cases e.g. joins, this is a manageable problem for a useful subset of queries, particularly those used by analytics systems to bulk read tables for later processing.

     

    In this project you will prototype a database adapter that can execute SQL queries which return results modified by reversed CDC events such that they reflect database state at a specified point in the past.  Some knowledge of SQL, JDBC, CDC and related technologies would be beneficial. This project may involve exposure to the internals of some Red Hat projects, potentially including Debezium, Teiid and Hibernate.

     

     

    Optimising software build times with maven and http2

     

    HTTP/2, the latest generation of the web transport protocol, introduces performance enhancing features including server push, in which a server may send a client some resource it has not explicitly requested. This feature is used to optimize web page load times by sending dependencies (e.g. css stylesheets, images, javascript files) for a html page  before a browser requests them. Where latency is high relative to the available bandwidth this can increase responsiveness substantially.

     

    The maven build system is widely used for dependency management in Java projects, downloading code libraries and other artifacts from a repository server over http and caching them locally. For large projects this activity may take a substantial time.  The current maven release uses the older HTTP/1.x protocol.

     

    In this project you will investigate the opportunities for improving maven build times by utilising http2 to perform the communication to the repository server. Some prior knowledge of maven and http would be beneficial.

     

     

    Robo-journalism

     

    Over countless generations of evolution, humans have become adept at absorbing information through stories. From hunter-gatherers sitting round the fire, to the printing press and internet, our interest in an informative and well told narrative has remained constant. Unfortunately so too has the labour intensive creative effort needed to craft one.  The situation is finally starting to change for highly data-rich subject domains such as finance and sports. Here event reports frequently follow a highly stylised pattern, essentially driven from a story template whose branching flow and variable substitution start to make journalism look a lot like programming. (see e.g. http://www.wired.com/gadgetlab/2012/04/can-an-algorithm-write-a-better-news-story-than-a-human-reporter/  http://www.benzinga.com/news/14/06/4672045/ap-using-robots-for-journalism-starting-in-july )

     

    In this project you will investigate robo-journalism, the process of producing software generated news stories and reports, as an alternative or complement to traditional corporate reporting forms such as charts and tables.  Can data rich domains such as web traffic analysis and open source software release lifecycles be described in story form by software driven from server logs, bug tracker and version control data? Stay tuned for an exclusive report...

     

    Note that a strong grasp of English grammar is required for this project. If you don't know your adverbs from your pronouns you're probably going to struggle.

     

     

    Game oriented training for Enterprise Java APIs

     

    Programming skills may be taught, practiced or assessed using a game oriented approach e.g.

    https://www.codingame.com/ https://codecombat.com/ https://screeps.com/ https://microcorruption.com http://robowiki.net/wiki/Robocode http://starfighters.io/

     

    This approach may be useful for lowering variable costs and increasing engagement in low-touch areas such as continuing education (MOOCs, professional certification courses) and software engineering recruitment.

     

    Provision of code examples and tutorials is a widespread approach to software project documentation for new users, but provides no learning feedback. Primitive assessment methods based on multiple-choice questions is an easy to implement but limited approach to addressing this problem. A game oriented approach offers richer possibilities, but requires a considerably more sophisticated execution platform. Scoring and assessment of victory conditions are among the many challenges, as this may require e.g. secure execution of untrusted code and qualitative evaluation of source listings in addition to pattern matching of runtime output.

     

    Providing a gaming context in which programmers can learn to use the standard Java Enterprise APIs (e.g. JMS, EJB) and related JBoss project APIs, can receive largely automated feedback on their progress and can be monitored/mentored by interested parties, would provide a novel and powerful way of engaging with users. It may facilitate the delivery of education services to Red Hat customers, provide data to fuel analysis of API design and documentation quality, or help screening job candidates for technical roles.

     

    In this project you will consider the challenges of providing a platform for the delivery of game oriented training in the use of one or more Java Enterprise APIs, focusing on the problem of providing evaluation and feedback to users in a largely automated manner. Use of techniques such as byte code instrumentation, mocking, static analysis and virtualization may be relevant.

     

     

    Distributed application tracing and analysis

     

    Understanding the behaviour of complex software systems is challenging under the best of circumstances and even more so when the system is not directly observable and the analysis is time critical. Such is the situation faced daily by Red Hat's global support team when diagnosing problems with customer systems within SLA mandated time windows. A system's behaviour must be analyzed from the log files it produces. Since each problem is different, it's challenging for developers to know what to log. Further, logging is often reduced in production systems due to performance overhead.  The logs that are produced are human readable, but this renders them difficult to handle by automated analysis tools, as important semantic metadata can be missing or obscured. Log analysis thus frequently relies solely on primitive tooling (e.g. grep), expert knowledge of the codebase and a fair bit of intuition.  An improved solution offers the potential for faster, more cost effective resolution of customer support cases.

     

    Some disjoint pieces of the solution appear to exist.  TNT4J provides a semantically rich API for generating log statements from the code. Chronicle-logger provides a high performance way to persist this information.  Narayana Transaction Analyser provides a way to correlate, filter and visualize activity from logs, although as yet only for a specific subject domain.  Rule based systems and knowledge bases offer automated suggestions on potential solutions based on patterns found in the logs.

     

    In this project you will investigate use of these and other components to provide guidance on best-practice methods of generating and analyzing logs.

     

     

    Browser based source code navigation

     

    Developers typically have their most frequently accessed code indexed in an IDE (Intellij IDEA, Eclipse, etc) for easy navigation.  From time to time however, they require access to infrequently used code or legacy versions that are not locally resident. For such cases a web based code viewing and navigation solution is preferred.

     

    The popular subversion source control system comes with an apache httpd module that allows browser based viewing of the repository. However, navigation is limited to simple directory traversal and content representation is primitive, being limited to plain-text rendering of files. (e.g. http://anonsvn.jboss.org/repos/labs/labs/jbosstm/ )  Some commercial products do marginally better. Fisheye, for example, provides version (revision) based navigation and meta-data in addition to directory traversal. However, its html markup of content is still rather limited and focused on change highlighting rather than code navigation. (e.g. http://fisheye.jboss.org/browse/JBossTS/ )

     

    Contrast this to the rich, syntax-aware highlighting and navigation of source code provided by IDEs where e.g. keywords are colored, control-click on a variable navigates to the declaration point, control-click on a type navigates to its class or interface, clicking a function navigates to its source and such.

     

    Consider also the large quantity of code in other repositories: maven source .jars, github public repositories, etc. Accessing this information requires tedious additional steps (search, download, unpack, load in an IDE, ...) or proprietary interfaces that create information silos between which there is no linking - you can't click though from a github source listing to a svn repository listing for e.g. a library method call. This is unsatisfactory.

     

    In this project you will produce a web based code repository (svn or git) navigator with a) java source code syntax awareness and b)revision awareness. It will render the repository content with appropriate styling for e.g. keyword identification and hyperlinks for e.g. variables, functions and types. It will allow navigation between versions. To do this it may reuse existing libraries for repository integration (e.g. mvn, svnkit) and parsing .java source files.

     

    Additional work may include allowing for smart cross-repository linking e.g. where project A imports classes from project B, rendering of the former should include appropriate hyperlinks to classes in the repository of the latter. This navigation should also be version aware.  Creating unified virtual repositories for e.g. a specific open source software product and all the 3rd party open source libraries bundled with it would also be a valuable contribution.

     

    Some knowledge of code parsing methods, syntax trees, symbol tables and related topics is essential. Some prior knowledge of the java web stack (servlets/jsp), maven and one or more version control system would be beneficial.

     

     

    Archived Projects

     

    The projects below have previously been tackled by other students and are not normally available to new candidates.

     

    Graph modeling for the analysis of change in social networks

     

    Understanding the communities of users and contributors that evolve around open source software projects is key to monitoring and managing their health. The use of a graph representation to model the social relationships is a common technique. However, it is frequently limited to a point in time snapshot of the social network's state, an approach that fails to account for the change in community over time. By adding a temporal dimension to the data modeling, it becomes possible to compare the state of the community at selected points in time.

     

    A naive method of implementation is to construct one graph for each time period of interest, but this is inefficient as the number of periods grows and makes queries cumbersome. Adding time data to the nodes and edges of a single graph is preferred, but graph query languages frequently lack temporal primitive operators and functions to facilitate its use in analysis.

     

    In this project you will investigate techniques for efficiently representing and querying a dataset from a large online community, e.g. the stackexchange data dump (https://archive.org/details/stackexchange), with emphasis on modeling and analysing change over time. Some prior experience of graph databases (e.g. Neo4j) would be beneficial.

     

     

    Handling Time Series data at scale

     

    Many problems that are trivially solved for small systems, become a significant challenge when scaled up. Amongst these is handling time series data, such as samples of system metrics made for service monitoring purposes. In addition to sampling we consider counting event occurrences and analyzing logs.

     

    The tasks of sampling, counting and analysis are normally tackled with systems utilizing relational database technology.  Where it becomes technically challenging or expensive to meet the required scale or availability requirements with such solutions, nosql approaches may be employed. As part of this recent trend we see solutions such as openTSDB (sampling on HBase), rainbird/countandra (counting on Cassandra) and Hadoop (map/reduce for log analysis).  Integration with monitoring/alerting systems (e.g. ganglia, nagios, RHQ) and with complex event processing systems (e.g. esper, Drools Fusion) may also be considered.

     

    In this project you will work to create or extend one or more tool for handling time series data using nosql technologies, most likely cassandra and/or infinispan. Java based solutions will be preferred and candidates are expected to demonstrate a good grasp of Java programming. In addition, prior experience of a nosql system would be an advantage.

     

     

    Web traffic analytics in the cloud

     

    For may years Google Analytics has dominated the enterprise web traffic analysis market, by virtue of ease of use and competitive pricing. However, its capabilities are limited to largely pre-defined summary reports.  As organisations become more aware of the potential of Big Data analytics and develop in-house skills in this area, bringing the web traffic analytics function back under their own control becomes both feasible and desirable.

     

    However, there is a lack pre-packaged open source web log analytics software for them to use as a starting point. Existing solutions e.g. piwik are feature rich but based on relational database technology that does not scale sufficiently well for larger sites.

     

    In this project you will develop a web analytics package capable of operating at scale. This may include reusing selected parts of existing solutions as well as utilising cloud services to store and process data. Relevant skills include web application and web service engineering, web UI design, nosql data storage and map/reduce data processing.