The Newcastle Upon Tyne, UK office of Red Hat is predominantly a middleware engineering centre, with core developers from several middleware project teams. In addition to projects in those areas, the following topics are available to local candidates only. All work will be open source.
Notes on application procedure
Candidates should select one preferred project and optionally one reserve choice in which they have particular interest, in order that we can ensure their application is handled by the most appropriate software engineering staff. Notification of the selected projects should be sent by email, along with a C.V. and some source code. All these projects require competence in software engineering, so a body of work that demonstrates design, coding and testing skills is preferred. For postgraduate level candidates, the undergraduate final year project work may be most appropriate. Contributions to existing open source projects will also be considered favorably. Note the more constrained work, such as coursework from taught university modules, is unlikely to be sufficient to demonstrate the expected skill set. The provided source should be in Java, unless the selected project specifically calls for work in other languages. The submission will be reviewed by the potential project supervisors and candidates whose work meets the required standard will be invited to a face to face technical interview.
Addendum for 2016/17 onwards: in addition to topics below, candidates may submit their own proposals in the same format i.e. a brief outline of the problem to be addressed, the reason why it's interesting and, most importantly, an explanation of why it's useful to Red Hat.
Notes on project design
In contrast to some other industrial placements, most of these project topics require some element of original research. In such cases, students will be expected to become familiar with the state of the art in the relevant field and to contribute to it, whilst producing implementation work to the standard expected of a research prototype. Such projects may be viewed as having the same form and quality standards as a PhD, with much reduced scope to fit the allocated timescale.
Some projects may alternatively focus on software engineering discipline in preference to research, requiring students to produce code built and tested to production quality. These may require delivering features into a live project release on a fixed timeline. Finally, projects may focus on the practice of community open source, working in an open, agile and collaborative manner. Soft skills and non-code contributions may form a larger part of the assessment in such cases.
And now, the project list. The topics below are offered for academic year 2018/19.
Persistent Memory for Java Middleware
Persistent Memory retains data without power (like HDD/SSD storage), but is byte addressable (like RAM). It thus presents a new programming model, with novel characteristics and benefits that are not yet fully exploited. See e.g. https://www.snia.org/PM and https://www.usenix.org/system/files/login/articles/login_summer17_07_rudoff.pdf for background. Whilst some libraries are starting to emerge to provide higher level APIs to these low-level hardware features, see e.g. pmem.io: PMDK , the options for accessing this functionality from Java are still limited. For JavaEE middleware use cases that require fault tolerance, e.g. XA transactions logs, messaging logs, in-memory data grids and databases, PM represents a key opportunity for performance gains.
In this project you will explore ways to utilise persistent memory from Java middleware such as the Narayana transaction engine, Apache ActiveMQ-Artemis messaging system and Infinispan data grid. In addition to Java, some grasp of C will likely be necessary for this project. Experience of JNI, hardware architecture, linux system programming, profiling and benchmarking may be beneficial.
Web protocol proxy
Proxy servers provide a way to intercept and mutate client/server communication. For web applications this is useful for performance (e.g. caching), security (e.g. blocking malicious content) and service provision (e.g. content translation, ad blocking). Extensible HTTP proxy servers for developing such capabilities in Java already exist, e.g. LittleProxy but many modern web apps make use of additional communication features such as WebSockets or https://socket.io/ based data streams.
In this project you will develop a Java proxy framework capable of managing WebSocket frames and/or socket.io protocol messages. This will layer on the Netty.IO async I/O framework. Some prior knowledge of asynchronous I/O, network protocols and state machines would be beneficial.
Browser based source code navigation
Developers typically have their most frequently accessed code indexed in an IDE (Intellij IDEA, Eclipse, etc) for easy navigation. From time to time however, they require access to infrequently used code or legacy versions that are not locally resident. For such cases a web based code viewing and navigation solution is preferred.
The popular subversion source control system comes with an apache httpd module that allows browser based viewing of the repository. However, navigation is limited to simple directory traversal and content representation is primitive, being limited to plain-text rendering of files. (e.g. http://anonsvn.jboss.org/repos/labs/labs/jbosstm/ ) Some commercial products do marginally better. Fisheye, for example, provides version (revision) based navigation and meta-data in addition to directory traversal. However, its html markup of content is still rather limited and focused on change highlighting rather than code navigation. (e.g. http://fisheye.jboss.org/browse/JBossTS/ )
Contrast this to the rich, syntax-aware highlighting and navigation of source code provided by IDEs where e.g. keywords are colored, control-click on a variable navigates to the declaration point, control-click on a type navigates to its class or interface, clicking a function navigates to its source and such.
Consider also the large quantity of code in other repositories: maven source .jars, github public repositories, etc. Accessing this information requires tedious additional steps (search, download, unpack, load in an IDE, ...) or proprietary interfaces that create information silos between which there is no linking - you can't click though from a github source listing to a svn repository listing for e.g. a library method call. This is unsatisfactory.
In this project you will produce a web based code repository (svn or git) navigator with a) java source code syntax awareness and b)revision awareness. It will render the repository content with appropriate styling for e.g. keyword identification and hyperlinks for e.g. variables, functions and types. It will allow navigation between versions. To do this it may reuse existing libraries for repository integration (e.g. mvn, svnkit) and parsing .java source files.
Additional work may include allowing for smart cross-repository linking e.g. where project A imports classes from project B, rendering of the former should include appropriate hyperlinks to classes in the repository of the latter. This navigation should also be version aware. Creating unified virtual repositories for e.g. a specific open source software product and all the 3rd party open source libraries bundled with it would also be a valuable contribution.
The projects below have previously been tackled by other students and are not normally available to new candidates.
Graph assisted maven artifact usage analytics
The Apache Maven build and dependency management tool is widely used by Java developers and is the basis of code packaging and distribution for many open source projects, including most of those lead by JBoss engineers. Analysis of maven repository server logs can therefore give some indication of community uptake of new software releases, providing valuable feedback on adoption to developers. Simple counting techniques can be useful, e.g. number of users of each version over time can indicate how quickly users migrate to new feature/bugfix releases. However, such methods are limited as they fail to account for the dependency relationships between artifacts.
For traffic logs showing a user who requests artifact A, then artifact B soon after, where A is known to declare a dependency upon B, we may infer that the request for B was transitive via A, rather than a direct choice by the user. We may thus present a report with metrics showing how much of a project's usage is attributable to its inclusion via dependencies in various other projects. This requires building the dependency graph for all artifacts from the maven metadata and storing it in a manner that provides for efficient query during the analysis of the server traffic logs.
In this project you will build a system to facilitate reporting on project popularity by analysis of maven repository server traffic logs, focusing on utilisation of dependency graph information. Prior experience of maven, a suitable analysis platform (e.g. Apache Spark) and a graph database (e.g. Neo4j) would be beneficial.
http://graphql.org provides a potentially useful advance over traditional REST API designs, particularly with regard to reducing the number of latency-inducing remote calls a client must make to the server. However, this comes at the cost of greater complexity in implementing an efficient query plan for executing the GraphQL statements against the backend storage. Existing mechanisms e.g. https://github.com/jcrygier/graphql-jpa take a relatively straightforward approach to query translation, with little attention to performance costs.
In this project you will analyse and benchmark existing approaches to executing GraphQL using Hibernate, potentially encompassing both traditional relational database backends and NoSQL db alternatives via HibernateOGM. You will provide design guidelines for best practice guidance to users, as well as prototyping new query execution strategies to address and performance issues you identify. Prior experience of GraphQL, REST, Hibernate and relational or noSQL databases would be beneficial.
Graph modeling for the analysis of change in social networks
Understanding the communities of users and contributors that evolve around open source software projects is key to monitoring and managing their health. The use of a graph representation to model the social relationships is a common technique. However, it is frequently limited to a point in time snapshot of the social network's state, an approach that fails to account for the change in community over time. By adding a temporal dimension to the data modeling, it becomes possible to compare the state of the community at selected points in time.
A naive method of implementation is to construct one graph for each time period of interest, but this is inefficient as the number of periods grows and makes queries cumbersome. Adding time data to the nodes and edges of a single graph is preferred, but graph query languages frequently lack temporal primitive operators and functions to facilitate its use in analysis.
In this project you will investigate techniques for efficiently representing and querying a dataset from a large online community, e.g. the stackexchange data dump (https://archive.org/details/stackexchange), with emphasis on modeling and analysing change over time. Some prior experience of graph databases (e.g. Neo4j) would be beneficial.
Handling Time Series data at scale
Many problems that are trivially solved for small systems, become a significant challenge when scaled up. Amongst these is handling time series data, such as samples of system metrics made for service monitoring purposes. In addition to sampling we consider counting event occurrences and analyzing logs.
The tasks of sampling, counting and analysis are normally tackled with systems utilizing relational database technology. Where it becomes technically challenging or expensive to meet the required scale or availability requirements with such solutions, nosql approaches may be employed. As part of this recent trend we see solutions such as openTSDB (sampling on HBase), rainbird/countandra (counting on Cassandra) and Hadoop (map/reduce for log analysis). Integration with monitoring/alerting systems (e.g. ganglia, nagios, RHQ) and with complex event processing systems (e.g. esper, Drools Fusion) may also be considered.
In this project you will work to create or extend one or more tool for handling time series data using nosql technologies, most likely cassandra and/or infinispan. Java based solutions will be preferred and candidates are expected to demonstrate a good grasp of Java programming. In addition, prior experience of a nosql system would be an advantage.
Web traffic analytics in the cloud
For may years Google Analytics has dominated the enterprise web traffic analysis market, by virtue of ease of use and competitive pricing. However, its capabilities are limited to largely pre-defined summary reports. As organisations become more aware of the potential of Big Data analytics and develop in-house skills in this area, bringing the web traffic analytics function back under their own control becomes both feasible and desirable.
However, there is a lack pre-packaged open source web log analytics software for them to use as a starting point. Existing solutions e.g. piwik are feature rich but based on relational database technology that does not scale sufficiently well for larger sites.
In this project you will develop a web analytics package capable of operating at scale. This may include reusing selected parts of existing solutions as well as utilising cloud services to store and process data. Relevant skills include web application and web service engineering, web UI design, nosql data storage and map/reduce data processing.