Version 31

    The Newcastle Upon Tyne, UK office of Red Hat is predominantly a middleware engineering centre, with core developers from several middleware project teams. In addition to projects in those areas, the following topics are available to local candidates only. All work will be open source.

     

    Notes on application procedure

     

    Candidates should select one preferred project and optionally one reserve choice in which they have particular interest, in order that we can ensure their application is handled by the most appropriate software engineering staff. Notification of the selected projects should be sent by email, along with a C.V. and some source code. All these projects require competence in software engineering, so a body of work that demonstrates design, coding and testing skills is preferred. For postgraduate level candidates, the undergraduate final year project work may be most appropriate. Contributions to existing open source projects will also be considered favorably. Note the more constrained work, such as coursework from taught university modules, is unlikely to be sufficient to demonstrate the expected skill set.   The provided source should be in Java, unless the selected project specifically calls for work in other languages. The submission will be reviewed by the potential project supervisors and candidates whose work meets the required standard will be invited to a face to face technical interview.

     

    Addendum for 2016/17 onwards: in addition to topics below, candidates may submit their own proposals in the same format i.e. a brief outline of the problem to be addressed, the reason why it's interesting and, most importantly, an explanation of why it's useful to Red Hat.

     

    Notes on project design

     

    In contrast to some other industrial placements, most of these project topics require some element of original research. In such cases, students will be expected to become familiar with the state of the art in the relevant field and to contribute to it, whilst producing implementation work to the standard expected of a research prototype. Such projects may be viewed as having the same form and quality standards as a PhD, with much reduced scope to fit the allocated timescale.

     

    Some projects may alternatively focus on software engineering discipline in preference to research, requiring students to produce code built and tested to production quality. These may require delivering features into a live project release on a fixed timeline.  Finally, projects may focus on the practice of community open source, working in an open, agile and collaborative manner. Soft skills and non-code contributions may form a larger part of the assessment in such cases.

     

    And now, the project list. The topics below are offered for academic year 2018/19.

     

    Persistent Memory for Java Middleware

     

    Persistent Memory retains data without power (like HDD/SSD storage), but is byte addressable (like RAM). It thus presents a new programming model, with novel characteristics and benefits that are not yet fully exploited. See e.g. https://www.snia.org/PM  and https://www.usenix.org/system/files/login/articles/login_summer17_07_rudoff.pdf  for background. Whilst some libraries are starting to emerge to provide higher level APIs to these low-level hardware features, see e.g. pmem.io: PMDK , the options for accessing this functionality from Java are still limited. For JavaEE middleware use cases that require fault tolerance, e.g. XA transactions logs, messaging logs, in-memory data grids and databases, PM represents a key opportunity for performance gains.

     

    In this project you will explore ways to utilise persistent memory from Java middleware such as the Narayana transaction engine, Apache ActiveMQ-Artemis messaging system and Infinispan data grid. In addition to Java, some grasp of C will likely be necessary for this project. Experience of JNI, hardware architecture, linux system programming, profiling and benchmarking may be beneficial.

     

    Browser based source code navigation

     

    Developers occasionally wish to review and navigate source code that is not locally resident i.e. not imported into their desktop IDE. For single file cases this is sometimes possible, e.g. by browsing github. However, sometimes it is not e.g. maven repositories have source jars, but only support downloading the entire jar, not looking inside it. Multi-file cases are even more problematic as, although many rendering systems will provide syntax highlighting, few allow navigation between files by e.g. clicking on type or method declarations. Whilst exceptions do exist (see e.g. zgrepcode.com) none currently allows the linking between files to be dynamically configured by the user. For users accustomed to the rich navigation support offered by IDEs, this is not satisfactory.

     

    In this project you will produce a web based Java code repository navigator which addresses these issues, particularly by introducing the novel feature of allowing a user provided classpath to influence the navigation i.e. support dynamic symbol resolution. This is a multi-faceted project encompassing on the one hand Java code parsing, syntax trees, symbol tables, type resolution algorithms and suchlike, and on the other hand, web application architecture, deployment and operations at scale, ideally to a cloud based environment, in order to put the developed service into production. Due to time limitations it may be necessary to emphasize one aspect over the other, but at least some attention to both will be required.

     

    Some knowledge of code parsing methods, syntax trees and symbol tables would be beneficial, as would understanding of web UI technologies (html/css/javascript), the language server protocol, maven, and cloud computing. Whilst some components will likely require programming in Java, the option exists to explore use of other languages for some functionality, e.g. golang or javascript for a symbol resolver deployed to a serverless environment.

     

    Machine Learning for Code Completion

     

    IDEs provide code completion suggestions through analysis of the code structure, for example to suggest available variable/method/type names. These suggestions, whilst accurate, are not necessarily well ordered. A handful of simple heuristics are used to decide suggestion ordering and where these are not optimal the user must apply extra effort to cycle through available options.

     

    Recent research efforts seek to apply machine leaning to generating code (Program that repairs programs: how to achieve 78.3 percent precision in automated program repair - Microsoft Research , https://medium.com/@martin.monperrus/human-competitive-patches-in-automatic-program-repair-with-repairnator-359042e00f6a ) and to tuning software behavior (https://arxiv.org/pdf/1712.01208.pdf , https://www.cs.cmu.edu/~ggordon/van-aken-etal-parameters.pdf , http://cidrdb.org/cidr2019/papers/p117-kraska-cidr19.pdf) but little attention has been given to the problem of ordering code completion suggestions.

     

    In this project you will develop a system for training and evaluating a ML model for Java code completion suggestion ordering and evaluate the approach against existing heuristic techniques. A good knowledge of Java will be required, along with some understanding of machine learning techniques.

     

     

    Archived Projects

     

    The projects below have previously been tackled by other students and are not normally available to new candidates.

     

     

    Graph assisted maven artifact usage analytics

     

    The Apache Maven build and dependency management tool is widely used by Java developers and is the basis of code packaging and distribution for many open source projects, including most of those lead by JBoss engineers. Analysis of maven repository server logs can therefore give some indication of community uptake of new software releases, providing valuable feedback on adoption to developers. Simple counting techniques can be useful, e.g. number of users of each version over time can indicate how quickly users migrate to new feature/bugfix releases. However, such methods are limited as they fail to account for the dependency relationships between artifacts.

     

    For traffic logs showing a user who requests artifact A, then artifact B soon after, where A is known to declare a dependency upon B, we may infer that the request for B was transitive via A, rather than a direct choice by the user. We may thus present a report with metrics showing how much of a project's usage is attributable to its inclusion via dependencies in various other projects.  This requires building the dependency graph for all artifacts from the maven metadata and storing it in a manner that provides for efficient query during the analysis of the server traffic logs.

     

    In this project you will build a system to facilitate reporting on project popularity by analysis of maven repository server traffic logs, focusing on utilisation of dependency graph information. Prior experience of maven, a suitable analysis platform (e.g.  Apache Spark) and a graph database (e.g. Neo4j) would be beneficial.

     

    GraphQL to Hibernate query adapter

     

    http://graphql.org provides a potentially useful advance over traditional REST API designs, particularly with regard to reducing the number of latency-inducing remote calls a client must make to the server. However, this comes at the cost of greater complexity in implementing an efficient query plan for executing the GraphQL statements against the backend storage. Existing mechanisms e.g. https://github.com/jcrygier/graphql-jpa take a relatively straightforward approach to query translation, with little attention to performance costs.

     

    In this project you will analyse and benchmark existing approaches to executing GraphQL using Hibernate, potentially encompassing both traditional relational database backends and NoSQL db alternatives via HibernateOGM. You will provide design guidelines for best practice guidance to users, as well as prototyping new query execution strategies to address and performance issues you identify. Prior experience of GraphQL, REST, Hibernate and relational or noSQL databases would be beneficial.

     

     

    Graph modeling for the analysis of change in social networks

     

    Understanding the communities of users and contributors that evolve around open source software projects is key to monitoring and managing their health. The use of a graph representation to model the social relationships is a common technique. However, it is frequently limited to a point in time snapshot of the social network's state, an approach that fails to account for the change in community over time. By adding a temporal dimension to the data modeling, it becomes possible to compare the state of the community at selected points in time.

     

    A naive method of implementation is to construct one graph for each time period of interest, but this is inefficient as the number of periods grows and makes queries cumbersome. Adding time data to the nodes and edges of a single graph is preferred, but graph query languages frequently lack temporal primitive operators and functions to facilitate its use in analysis.

     

    In this project you will investigate techniques for efficiently representing and querying a dataset from a large online community, e.g. the stackexchange data dump (https://archive.org/details/stackexchange), with emphasis on modeling and analysing change over time. Some prior experience of graph databases (e.g. Neo4j) would be beneficial.

     

     

    Handling Time Series data at scale

     

    Many problems that are trivially solved for small systems, become a significant challenge when scaled up. Amongst these is handling time series data, such as samples of system metrics made for service monitoring purposes. In addition to sampling we consider counting event occurrences and analyzing logs.

     

    The tasks of sampling, counting and analysis are normally tackled with systems utilizing relational database technology.  Where it becomes technically challenging or expensive to meet the required scale or availability requirements with such solutions, nosql approaches may be employed. As part of this recent trend we see solutions such as openTSDB (sampling on HBase), rainbird/countandra (counting on Cassandra) and Hadoop (map/reduce for log analysis).  Integration with monitoring/alerting systems (e.g. ganglia, nagios, RHQ) and with complex event processing systems (e.g. esper, Drools Fusion) may also be considered.

     

    In this project you will work to create or extend one or more tool for handling time series data using nosql technologies, most likely cassandra and/or infinispan. Java based solutions will be preferred and candidates are expected to demonstrate a good grasp of Java programming. In addition, prior experience of a nosql system would be an advantage.

     

     

    Web traffic analytics in the cloud

     

    For may years Google Analytics has dominated the enterprise web traffic analysis market, by virtue of ease of use and competitive pricing. However, its capabilities are limited to largely pre-defined summary reports.  As organisations become more aware of the potential of Big Data analytics and develop in-house skills in this area, bringing the web traffic analytics function back under their own control becomes both feasible and desirable.

     

    However, there is a lack pre-packaged open source web log analytics software for them to use as a starting point. Existing solutions e.g. piwik are feature rich but based on relational database technology that does not scale sufficiently well for larger sites.

     

    In this project you will develop a web analytics package capable of operating at scale. This may include reusing selected parts of existing solutions as well as utilising cloud services to store and process data. Relevant skills include web application and web service engineering, web UI design, nosql data storage and map/reduce data processing.