The Newcastle Upon Tyne, UK office of Red Hat is predominantly a middleware engineering centre, with core developers from several middleware project teams. In addition to projects in those areas, the following topics are available to local candidates only. All work will be open source.
Notes on application procedure
Candidates should select one preferred project and optionally one reserve choice in which they have particular interest, in order that we can ensure their application is handled by the most appropriate software engineering staff. Notification of the selected projects should be sent by email, along with a C.V. and some source code. All these projects require competence in software engineering, so a body of work that demonstrates design, coding and testing skills is preferred. For postgraduate level candidates, the undergraduate final year project work may be most appropriate. Contributions to existing open source projects will also be considered favorably. Note the more constrained work, such as coursework from taught university modules, is unlikely to be sufficient to demonstrate the expected skill set. The provided source should be in Java, unless the selected project specifically calls for work in other languages. The submission will be reviewed by the potential project supervisors and candidates whose work meets the required standard will be invited to a face to face technical interview.
Addendum for 2016/17 onwards: in addition to topics below, candidates may submit their own proposals in the same format i.e. a brief outline of the problem to be addressed, the reason why it's interesting and, most importantly, an explanation of why it's useful to Red Hat.
Notes on project design
In contrast to some other industrial placements, most of these project topics require some element of original research. In such cases, students will be expected to become familiar with the state of the art in the relevant field and to contribute to it, whilst producing implementation work to the standard expected of a research prototype. Such projects may be viewed as having the same form and quality standards as a PhD, with much reduced scope to fit the allocated timescale.
Some projects may alternatively focus on software engineering discipline in preference to research, requiring students to produce code built and tested to production quality. These may require delivering features into a live project release on a fixed timeline. Finally, projects may focus on the practice of community open source, working in an open, agile and collaborative manner. Soft skills and non-code contributions may form a larger part of the assessment in such cases.
And now, the project list. These topics are offered for academic year 2019/20.
Persistent Memory for Java Middleware
Persistent Memory retains data without power (like HDD/SSD storage), but is byte addressable (like RAM). It thus presents a new programming model, with novel characteristics and benefits that are not yet fully exploited. See e.g. https://www.snia.org/PM and https://www.usenix.org/system/files/login/articles/login_summer17_07_rudoff.pdf for background. Whilst some libraries are starting to emerge to provide higher level APIs to these low-level hardware features, see e.g. pmem.io: PMDK , the options for accessing this functionality from Java are still limited. For JavaEE middleware use cases that require fault tolerance, e.g. XA transactions logs, messaging logs, in-memory data grids and databases, PM represents a key opportunity for performance gains.
In this project you will explore ways to utilise persistent memory from Java middleware such as the Narayana transaction engine, Apache ActiveMQ-Artemis messaging system and Infinispan data grid. In addition to Java, some grasp of C will likely be necessary for this project. Experience of JNI, hardware architecture, linux system programming, profiling and benchmarking may be beneficial.
Graph Analytics for mavenized Java libraries
Java software components are frequently combined by maven dependency management tooling, using project metadata expressed in the form of .pom files. Understanding these dependency relationships is key to e.g. determining the potential impact of bugs or security vulnerabilities, the spread of open source licenses, the adoption lifecycle of releases and suchlike.
In this project you will develop tooling and techniques for the analysis of a large corpus of maven metadata (~4million .pom files) using a predominantly graph-oriented approach. The use of graph processing frameworks (e.g. GraphX, Gelly, Giraph) and/or graph databases (e.g. Neo4j) will likely be central to the project. Some understanding of maven internals may also be useful.
Neural Machine Translation for Java bytecode decompilation
Decompilers translate a low level language representation of a program (bytecode, machine code) to a high level representation (source code), facilitating analysis of unknown binaries for security or bug fixing. Decompilers are typically hand-crafted using pattern matching techniques.
Recently the machine learning community has developed natural language translation techniques based on neural networks, which show considerable success. The possibility exists to apply this approach to the problem of (de)compilation, treating bytecode and source code as languages to be translated between.
Research in this field is at an early stage, with some limited success for the approach in an academic context, but little focus on Java. This project thus represents an opportunity to break new ground in an important research field.
Using a corpus of open source Java code and binaries, this project will apply the latest NMT decompilation research techniques to evaluate their suitability for use with Java and propose appropriate changes for enhancements. Some experience in machine learning, bytecode manipulation or JVM architecture would be beneficial.
The projects below, having previously been tackled by other students, are included for information only and are not normally available to new candidates.
Exceptions may be available for students who can demonstrate a research plan for materially extending the prior work in novel ways.
Browser based source code navigation
Developers occasionally wish to review and navigate source code that is not locally resident i.e. not imported into their desktop IDE. For single file cases this is sometimes possible, e.g. by browsing github. However, sometimes it is not e.g. maven repositories have source jars, but only support downloading the entire jar, not looking inside it. Multi-file cases are even more problematic as, although many rendering systems will provide syntax highlighting, few allow navigation between files by e.g. clicking on type or method declarations. Whilst exceptions do exist (see e.g. zgrepcode.com) none currently allows the linking between files to be dynamically configured by the user. For users accustomed to the rich navigation support offered by IDEs, this is not satisfactory.
In this project you will produce a web based Java code repository navigator which addresses these issues, particularly by introducing the novel feature of allowing a user provided classpath to influence the navigation i.e. support dynamic symbol resolution. This is a multi-faceted project encompassing on the one hand Java code parsing, syntax trees, symbol tables, type resolution algorithms and suchlike, and on the other hand, web application architecture, deployment and operations at scale, ideally to a cloud based environment, in order to put the developed service into production. Due to time limitations it may be necessary to emphasize one aspect over the other, but at least some attention to both will be required.
Graph assisted maven artifact usage analytics
The Apache Maven build and dependency management tool is widely used by Java developers and is the basis of code packaging and distribution for many open source projects, including most of those lead by JBoss engineers. Analysis of maven repository server logs can therefore give some indication of community uptake of new software releases, providing valuable feedback on adoption to developers. Simple counting techniques can be useful, e.g. number of users of each version over time can indicate how quickly users migrate to new feature/bugfix releases. However, such methods are limited as they fail to account for the dependency relationships between artifacts.
For traffic logs showing a user who requests artifact A, then artifact B soon after, where A is known to declare a dependency upon B, we may infer that the request for B was transitive via A, rather than a direct choice by the user. We may thus present a report with metrics showing how much of a project's usage is attributable to its inclusion via dependencies in various other projects. This requires building the dependency graph for all artifacts from the maven metadata and storing it in a manner that provides for efficient query during the analysis of the server traffic logs.
In this project you will build a system to facilitate reporting on project popularity by analysis of maven repository server traffic logs, focusing on utilisation of dependency graph information. Prior experience of maven, a suitable analysis platform (e.g. Apache Spark) and a graph database (e.g. Neo4j) would be beneficial.
http://graphql.org provides a potentially useful advance over traditional REST API designs, particularly with regard to reducing the number of latency-inducing remote calls a client must make to the server. However, this comes at the cost of greater complexity in implementing an efficient query plan for executing the GraphQL statements against the backend storage. Existing mechanisms e.g. https://github.com/jcrygier/graphql-jpa take a relatively straightforward approach to query translation, with little attention to performance costs.
In this project you will analyse and benchmark existing approaches to executing GraphQL using Hibernate, potentially encompassing both traditional relational database backends and NoSQL db alternatives via HibernateOGM. You will provide design guidelines for best practice guidance to users, as well as prototyping new query execution strategies to address and performance issues you identify. Prior experience of GraphQL, REST, Hibernate and relational or noSQL databases would be beneficial.
Graph modeling for the analysis of change in social networks
Understanding the communities of users and contributors that evolve around open source software projects is key to monitoring and managing their health. The use of a graph representation to model the social relationships is a common technique. However, it is frequently limited to a point in time snapshot of the social network's state, an approach that fails to account for the change in community over time. By adding a temporal dimension to the data modeling, it becomes possible to compare the state of the community at selected points in time.
A naive method of implementation is to construct one graph for each time period of interest, but this is inefficient as the number of periods grows and makes queries cumbersome. Adding time data to the nodes and edges of a single graph is preferred, but graph query languages frequently lack temporal primitive operators and functions to facilitate its use in analysis.
In this project you will investigate techniques for efficiently representing and querying a dataset from a large online community, e.g. the stackexchange data dump (https://archive.org/details/stackexchange), with emphasis on modeling and analysing change over time. Some prior experience of graph databases (e.g. Neo4j) would be beneficial.
Handling Time Series data at scale
Many problems that are trivially solved for small systems, become a significant challenge when scaled up. Amongst these is handling time series data, such as samples of system metrics made for service monitoring purposes. In addition to sampling we consider counting event occurrences and analyzing logs.
The tasks of sampling, counting and analysis are normally tackled with systems utilizing relational database technology. Where it becomes technically challenging or expensive to meet the required scale or availability requirements with such solutions, nosql approaches may be employed. As part of this recent trend we see solutions such as openTSDB (sampling on HBase), rainbird/countandra (counting on Cassandra) and Hadoop (map/reduce for log analysis). Integration with monitoring/alerting systems (e.g. ganglia, nagios, RHQ) and with complex event processing systems (e.g. esper, Drools Fusion) may also be considered.
In this project you will work to create or extend one or more tool for handling time series data using nosql technologies, most likely cassandra and/or infinispan. Java based solutions will be preferred and candidates are expected to demonstrate a good grasp of Java programming. In addition, prior experience of a nosql system would be an advantage.
Web traffic analytics in the cloud
For may years Google Analytics has dominated the enterprise web traffic analysis market, by virtue of ease of use and competitive pricing. However, its capabilities are limited to largely pre-defined summary reports. As organisations become more aware of the potential of Big Data analytics and develop in-house skills in this area, bringing the web traffic analytics function back under their own control becomes both feasible and desirable.
However, there is a lack pre-packaged open source web log analytics software for them to use as a starting point. Existing solutions e.g. piwik are feature rich but based on relational database technology that does not scale sufficiently well for larger sites.
In this project you will develop a web analytics package capable of operating at scale. This may include reusing selected parts of existing solutions as well as utilising cloud services to store and process data. Relevant skills include web application and web service engineering, web UI design, nosql data storage and map/reduce data processing.