2007-2008:technical

3 Technical Operations and R&D

3.1 Data Processor Maintenance

3.1.1 Data Playback Units

The DPUs were finally decommissioned in 2007. Several were sold to Metrum, some still remain in place, mainly to preserve the airflow of the system as a whole. Their PR-appeal should not be underestimated, visiting journalists, photographers and TV-crews mostly insist on using the tape units, preferably with tapes spinning, as a backdrop. As more space is needed for other equipment, the DPUs are removed and replaced by new racks.

3.1.2

Station Units

On the whole, the Station Units performed well. Individual components however continue to deteriorate, and the pool of available spare parts has dwindled considerably. Partly switching over to Mark5B at the correlator would allow us to retire at least a few SUs, but for this we would need more stations to do the same. As it is, too many projects still require 16 Mark5A playback units to allow the installation of Mark5Bs on a permanent basis.

3.1.3 Mark5 Units

The Mark5 units, both in their A and B personalities, performed reliably. Some minor hardware problems were repaired in-house. A spare Mark5B was purchased for code development and testing purposes, but was transferred to Westerbork quite soon after arrival, to help with the debugging and commissioning of TADUmax.

3.2 Data Processor Developments and Upgrades

3.2.1 Mark5

During the period 2007-2008 all Mark5 units were upgraded. New motherboards, CPUs, memory banks, hard disk drives and power supplies were installed, and several units were equipped with 10GE interfaces. Noticing large temperature differences within the Mark5 units, the JIVE operators devised a method to significantly improve airflow inside the Mark5 housing, thereby reducing the chance of disk pack overheating. This method was later adopted by Conduant and is now implemented in all new Mark5 units.

3.2.2 Archive

At the start of 2007, the total capacity of the JIVE data archive was 5.7 TByte. Fairly soon it became clear that this space would run out within roughly one year, so the decision was made to replace all 250MB disks by 500MB disks, and fill all available slots in the raid cabinets. In order to use the total disk space as efficiently as possible, very large file-systems were needed. This in turn asked for upgrades to both operating system and drivers, which proved problematic. Replacing the aging archive machine with a modern server solved these problems. The total capacity of the archive is now 15 TB, of which 6 TB is currently in use. It is expected that this should suffice for the next 2 to 3 years.

3.2.3 Re-circulation

Re-circulation, which enables one to optimize the use of a correlator through time-sharing its computing resources, was tested and verified and is now considered an operational capability of the EVN correlator.

3.2.4 Replacement of data acquisition platform

A new Solaris server equipped with a high-capacity raid array has replaced the data acquisition machine after a series of correlation tests. This machine is fully interchangeable with the two correlator-control machines, providing extra resilience to the correlator system. With the installation of this machine, a re-circulation enabled version of the correlator code was installed and as a consequence, the whole system now runs permanently at 64 BOCFs. The installation of additional equipment caused the temperature in some racks to reach critical levels. This was solved partly through airflow modifications and partly through redistribution of hardware.

3.2.5 Mark5A to B upgrade

Although the Station Units have performed relatively well, spare parts are few and replacements unavailable. Upgrading the Mark5A units currently in use at JIVE to Mark5B would allow us to phase out the SUs.

Several Mark5A units were converted to B and hooked up to the correlator via Correlator Interface Boards and optical serial links, and the Haystack-developed Mark5B control software was modified for use with the EVN correlator. Testing however was seriously hampered by the initial lack of suitable 5B data. Using modified LBA data, a significant difference was found between correlation using A+ and B. Extensive consultations with Haystack engineers identified a 1-second offset. As usual, fixing proved a lot easier than finding the problem, and new tests showed no significant differences in the results from A, A+ and B playback.

Since these tests, Westerbork, Effelsberg and Yebes have started producing B-data on a regular basis, data that however are usually played back on A+ units at JIVE. As more stations switch to B-recording, it will become possible to permanently install B-units for playback.

3.2.6 PCInt

As a data reduction platform, the PCInt cluster plays a vital role in the correlation process. The control computer of the PCInt subnet also serves as a boot host for the correlator’s Linux-based single-board computers. Considering the age and the importance of this system, the decision was made to purchase a back-up control computer. Configuration of this computer however turned out to be quite an effort, as the O/S level had to be upgraded, and moreover, a number of required server applications (e.g. DNS) had changed considerably.

3.2.7 Migration of HP-RT boot environment

The HP real-time computers, located inside the correlator racks, boot from an HP server, which until 2006 used to be the correlator control machine (an HP C240). A new boot environment was set up on a pair of (much newer) HP B2000 machines, with mirrored hard disks. This setup should provide protection against most types of hardware failure, and limit correlator downtime to a minimum.

3.3 Software Correlator

In early 2007, the SFXC software correlator produced its first fringes on astronomy data. This was a major milestone showing that the C++ implementation of the original algorithm developed for the Huygens project was functional. After that the correlator code underwent drastic changes. The code was modularized and further parallelized by using MPI. This made it possible to distribute the correlator over machines within a cluster. A module to generate the delay model was added. This model is based on the same CALC10 code used by the hardware correlator. The original configuration file format had some limitations that made it unsuitable for use in typical astronomy experiments that observe multiple sub bands and polarizations. It has been replaced with a more flexible (and simpler) format based on JSON that supplements information that is now read directly from the VEX file. The output format of the correlator also underwent a complete overhaul. Software to translate the output into an AIPS++ MeasurementSet was written. This allows the use of the same tools as used for analyzing and processing the output of the hardware correlator on SFXC output. This also makes conversion to FITS-IDI possible. Data correlated with SFXC was successfully loaded into AIPS, and a preliminary first image was produced. The module that decodes the input data has been extended to accept Mark5B and VLBA data as well as MKIV data, such that SFXC can now handle all data formats handled by the MkIV hardware correlator.

Yurii Pidopryhora, who joined the software correlator development team as a support scientist in early 2007, spent quite a bit of effort on verification of the results from the correlator. He has done a statistical analysis of the correlator output and made comparisons with the output of the existing hardware correlator. This work has uncovered several bugs, which since have been fixed. Verification of the results continues as the code is still changing while we try to optimize it and add new features to it.

Figure 1: Comparison of phase and amplitude of results from hardware correlator (left) with SFXC (right), in AIPS

The software correlator was first used to process FTP fringe tests in May 2007. The original plan had been to run it in parallel with the NICT software correlator that was used for the fringe tests in previous years. But since the machine that had the NICT correlator installed broke down, we had to rely solely on SFXC. It worked well enough that the support group never went back to using the NICT correlator. Quite a bit of effort was spent on making web pages to display the results of the fringe tests in a way that is convenient for the operators at the stations. The new web pages are now automatically generated whenever the correlator runs.

Figure 2: Web page displaying recent FTP fringe tests results

For the EXPReS FABRIC JRA, which aims at running the software correlator in a standard Grid environment, several web services were developed. These web services interact with the workflow manager and VLBI Grid broker being developed by our collaborators at PSNC in Poznan, Poland. The web-services implement some “domain-specific” knowledge like decoding VEX files and handling data transport. Integration of the various subsystems is still in progress.

Within the SCARIe project a collaboration was started with the AutoBAHN JRA of the GÉANT2 project. AutoBAHN is developing a system for on-demand allocation of dedicated circuits across the European research networks. This would benefit e-VLBI with a software correlator since there is not necessarily a fixed location for the correlator anymore. This collaboration resulted in a couple of demonstrations during which data were streamed from fours sites spread around Europe (Ireland, Poland, Greece, Croatia) and the US (Boston) into the DAS-3 cluster in Amsterdam, correlating in real-time at 256 Mbit/s. These highly successful demonstrations took place at the GLIF workshop in Seattle (October 1-2, 2008) and at Supercomputing '08 in Austin (November 15-21).

The team working on the software correlator has seen many personnel changes. Mark Kettenis took over day-to-day management from Huib Jan van Langevelde in February 2007. Ruud Oerlemans left JIVE at the end of June 2007. His job was taken over by Huseyin Özdemir, who started in August 2007. Huseyin left at the end of March 2008, and Des Small (who was already working on parts of the EXPReS project) took over most of his duties. Nico Kruithof left at the end June 2008. Aard Keimpema started working for the SCARIe project, taking over where Nico left off, in September 2008.

3.4 e-VLBI

At the beginning of 2007, the EXPReS project was well underway and gaining momentum. Scientific e-VLBI runs were taking place on a regular basis, albeit at low data rates, and many soft- and hardware modifications, both at the correlator and at the stations, had led to a much-improved operational real-time system. However, some big problems still remained to be tackled, such as the establishment of reliable high-bandwidth data transfers within Europe, long-haul data transport from telescopes on other continents and the efficient use of available bandwidth.

3.4.1 Local network

As a first step, the complete network at JIVE was overhauled. An HP ProCurve 5412zl router was purchased to handle up to 16 1-Gbps lightpaths and one 10-Gbps IP-switched lambda (capped at 5 Gbps) from SURFnet, and all interconnects between SURFnet, Mark5 units and control- and test computers. A second, smaller, HP switch was installed to deal with all remaining correlator-related network traffic, removing dependencies on the ASTRON internal network. New monitoring software was installed, enabling the generation of graphs of the status and data throughput of the e-VLBI network.

3.4.2 International networks

Several stations in Europe were connected via dedicated 1-Gbps lightpaths across GÉANT2, and new lightpaths to China and Australia were set up through the good services of the GLIF collaboration. South America was connected via GÉANT2 and the EC-sponsored RedClara network. A second connection to China, via the EC-sponsored TEIN2 network, was also established by DANTE.

In 2008 Arecibo re-joined the e-EVN through a 512-Mbps shared connection to mainland USA (with full bandwidth limited to certain timeslots), and a VLAN to JIVE via AtlanticWave and SURFnet. That year the Effelsberg Radio Telescopes also came online, providing a second tremendous boost to the sensitivity of the e-EVN.

3.4.3 Software developments

Throughout, efforts continued on monitoring, control and post-processing software tools. Special emphasis was put on improving the robustness and the real-time behavior of the correlator control software, enabling rapid adjustment of correlator parameters and adaptive observing. Modifications to the correlator control code were implemented which made it possible to remove and add stations in the middle of correlator jobs, without having to restart the entire job. This removed one of the main causes of data loss during e-VLBI runs and resulted in a tremendous increase of productivity.

3.4.4 Data transport issues

When the first e-VLBI experiments started, all data transport with the Mark5A units was done through the TCP/IP protocol. This protocol is specifically designed to guarantee fairness on the Internet, by throttling back data throughput as soon as packet loss is detected (interpreted as congestion). After such an event, the data rate is slowly increased again and will eventually (in the absence of further packet loss) reach the original data throughput. However, the recovery time increases with RTT (round trip time). As a result, this protocol is particularly unsuitable for intercontinental real-time data transfers such as needed for global e-VLBI.

Two important e-VLBI demonstrations were planned for 2007: a demo at the Asian-Pacific Advanced Networking (APAN) conference in Xi’An, China, which would involve telescopes in China, Australia and Europe, and an e-VLBI run in which data from three Australian telescopes were to be correlated in real-time at the EVN correlator in Dwingeloo (an actual EXPReS deliverable). Very soon after the start of data transfer tests, it was realized that TCP would simply not do; data throughput from Shanghai never reached more than ~20 Mbps.

Figure 3: results of data transfer tests between Shanghai Observatory and JIVE using TCP

UDP, a connectionless protocol, should in theory perform much better (at the cost of the connectivity of other users) but was disabled in the Haystack-developed Mark5A control code. Re-enabling UDP transfer gave very poor results, and in the end the decision was made to completely re-write the e-VLBI related portion of the Mark5A control code at JIVE. This new code features rigorous thread control and options to handle out-of-order packets, to space the packets regularly (in order to prevent bursts of data), and to selectively drop packets at the sending side while padding the data stream at the receiving end with dummy data, optimizing the use of available bandwidth.

Another solution was developed for the Australian disk recording systems (LBADR), in preparation of the EXPReS Australia-JIVE demo (planned for the first week of October 2007). Data from the LBADR units were converted on the fly to Mark5B format, transferred using the Circuit TCP (CTCP) protocol and received on Mark5A+ units at JIVE (CTCP is basically TCP without any congestion control at all). This method was successfully used in the e-VLBI demo at the APAN conference in Xi'An, resulting among others in fringes between Darnhall and Mopra (one of the longest VLBI baselines ever).

Figure 4: real-time fringes between Mopra and several EVN telescopes

During the Australia-JIVE demo, data from ATCA, Mopra and Parkes were transferred via three dedicated 1-Gbps lightpaths to JIVE and correlated in real-time. This time the UDP protocol was used, and 512 Mbps per telescope was sustained for 12 hours with hardly any packet loss at all. As with the APAN demo, this demo also generated quite a lot of public attention.

A number of further developments followed. Ensuring that only packets containing data are dropped, while leaving data headers intact, improved the behavior of the correlator during high data-rate e-VLBI greatly. Although very useful, packet dropping does increase the noise right across the observed bandwidth, and channel dropping, meaning that only specific sub bands are dropped at the stations, would in most cases be preferable. This method was shown to work on local machines, but not used in production because of the high CPU load involved. Implementation will follow when all e-VLBI stations have upgraded their systems with new CPUs and SMP-enabled Linux kernels, to make full use of the available CPU power. Related to this, changes were made to the correlator control software to allow different configurations at different stations, providing an additional tool to adjust data rates. Finally, simultaneous recording and transmitting of data at the station side was enabled, but has not been used operationally yet. Simultaneous playback and recording at the correlator cannot be done with Mark5A, but should be possible with Mark5B.

The JIVE-developed Mark5 control code was further adapted to work with Mark5B recording units. The Domino software however, supplied by Haystack for playback with Mark5B at the correlator, came without any support for e-VLBI. After Haystack engineers had added this functionality, further modifications were made at JIVE. In spite of successful tests with local units, tests with real data so far failed. Progress has been slow, mostly because of the small number of stations with both Mark5B units and sufficient connectivity. This situation however should improve in the coming year.

3.4.5 Adaptive observing

A first test was done on August 28 2008 with dynamic scheduling, in which a schedule was changed at JIVE during an observation, the new schedule file merged with the old one, distributed to the stations, DRUDG'ed locally (via ssh from JIVE) and run at the stations. The changes were made at Torun and Westerbork, with Jodrell Bank staying on the original schedule, and as planned fringes between Torun and Westerbork reappeared after the change. No new software had to be installed at the stations; the ssh commands were executed at stations using ssh in single-command mode from scripts run at JIVE. In the future, this method could prove particularly important for rapid response observations of transient sources.

3.4.6 Merlincast

On July 22 2008 a special test was done involving the MERLIN telescopes at Cambridge, Darnhall and Jodrell Bank (Mk2). In the current MERLIN network, the 'out-stations' are connected to Jodrell Bank by microwave links that have about 128 Mbps throughput. For this test, the links from both Darnhall and Cambridge were connected to the VLBA terminal. The VLBA terminal has 4 IF inputs, so each IF received data for one polarization from either Darnhall or Cambridge. The IF sampled data from both telescopes were then run through the formatter and Mark5 and transmitted to JIVE. At JIVE, the 'port monitoring' functionality of the central JIVE switch/router was used to 'snoop' on all the networking traffic towards one Mark5 and send duplicates to a second Mark5. With this setup fringes between all three stations were achieved. This experiment was repeated on the 9th of September, this time using IP Multicast to perform the packet duplication without having to undertake major networking changes at JIVE. This resulted in the first real-time fringes to the Knockin station at MERLIN. This technique, now dubbed ‘Merlincast’, has the potential to significantly improve the sensitivity of the e-EVN to larger scale structures.

3.4.7 Towards true 1-Gbps e-VLBI

Although the use of packet dropping in combination with UDP enables one to (nearly) fill available links to their limit, the full 1024 Mbps of e-VLBI traffic (plus overhead in the form of headers) will simply not fit on a standard 1-Gbps (= 1000 Mbps) connection. Several ways around this problem were investigated.

The Westerbork Radio Telescope is connected to Dwingeloo via dark fiber. Redundant CWDM equipment (kindly provided by the LOFAR group) was installed and equipped with a number of colors, 2 of which were reserved for e-VLBI traffic. In order to reach 1024 Mbps, a single data stream was divided in round-robin fashion over two independent 1-Gbps lightpaths, and recombined at the receiving end. Tests showed that transfers of 1500 Mbps were easily sustained in this way. This same method was applied to the dual 1-Gbps lightpath connections to the UK, and although shown to work in principle, use in production awaits a motherboard upgrade of one of the Mark5 units at Jodrell Bank.

Figure 5: network setup between Westerbork and JIVE

The Effelsberg Radio Telescope came online in 2008, through a dedicated fiber connecting the MPIfR in Bonn to the site. To accommodate both e-VLBI traffic and data transfers from their new e-LOFAR station, a 10-Gbps connection was established via Amsterdam to both Dwingeloo and Groningen. The Effelsberg Mark5 unit is equipped with two 1-GE interfaces, and the data stream is divided in a similar way as that used for the Westerbork connection. However, the two data streams are recombined on the local switch, and sent as a single data stream through a VLAN on the 10-Gbps link.

Figure 6: data throughput during first e-VLBI observations with Effelsberg

Onsala Radio Observatory was connected at 10 Gbps, to allow for future e-LOFAR data transfers, and to enable real-time 4-Gbps transfers from Onsala to the e-MERLIN correlator at Jodrell Bank (part of the EXPReS Joint Research Activity FABRIC). Torun Radio Telescope was connected at 10 Gbps to the Poznan supercomputing center; both Onsala and Torun Mark5 units are equipped with 10GE interfaces.

This synergy with other projects led to the milestone observations on the 19th of November 2008, during which Westerbork, Effelsberg and Onsala participated at a full 1024 Mbps.

3.4.8 Tests and operations

Slots of 24 hours were reserved at 4 to 6-week intervals for e-VLBI science. The four hours preceding each session were earmarked for setup and testing. Apart from this, many tests took place, depending on the availability of stations and the particular urgency of the test in question. Operational reliability and ease increased steadily throughout the last two years, and many successful science observations were conducted.

3.4.9 Demonstrations

Live demonstrations continued to be an important element of the outreach effort of EXPReS. Although they sometimes pose a big strain on the operational network, and can be quite disruptive, demonstrations are extremely useful in providing a focal point and speeding up developments.

The first demonstration of 2007, at the APAN conference in Xi’An, China, involved telescopes in China, Australia and Europe. Although fringes had been obtained in the past (at very modest data rates) between Arecibo and the European EVN telescopes, the distances in this demo would be considerably longer. What’s more, the LBA uses a completely different data acquisition system and data format. As mentioned in section 3.4.4, a large and diverse number of problems were solved, and after the track of the Shanghai telescope was broken and had been fixed (one week before the start of the conference!), the actual demo went without a hitch. For this demo we obtained access to the EC-sponsored trans-Siberian TEIN2 network, through the services of both Chinese Research Networks, CSTNET and CERNET. Data from the Australian telescopes was transferred via a dedicated lightpath connection provided by AARNET, CANARIE and SURFnet, and via the ‘normal’ Internet (which failed completely during the demo).

This was followed by the EXPReS-Australia demo, during which data from three telescopes were transferred, at 512 Mbps each, via three lightpaths to JIVE. This demo ran for more than 12 hours and nicely illustrated how VLBI may look in the future, connecting telescopes and correlators on opposite sides of the planet in real-time.

Figure 7: data transfer during EXPReS-Oz demo

In an unexpected development Hartebeesthoek became the next station to join the e-EVN. A 1-Gbps connection between Hh and the nascent South African NREN, SANReN, in Johannesburg, and from there at 64 Mbps via London to JIVE, became available in May 2008. Hh then participated in two very successful demos, of which one was rather ad-hoc, organized for the visit of a high-ranking EC delegation to the Hh telescope site.

The second demo took place at the high-profile TERENA 2008 conference, in Bruges, Belgium, where JIVE director Huib van Langevelde was the keynote speaker at the closing plenary meeting. This demo produced fringes between TIGO, Hh, Ar, Ef, Wb, Mc and On, effectively a 4-continent correlation, with the real-time results displayed by van Langevelde in his presentation.

Figure 8: e-VLBI demo display at TERENA conference

A smaller demo was run later in the year, during a presentation by Szomoru at the GLIF conference in Seattle, USA. By that time JIVE and the EVN had become sufficiently experienced to tackle a far more ambitious project, and towards the end of 2008 preparations started for a 24-hour real-time tracking of a single source, a truly global effort involving many non-EVN telescopes. This was to feature at the opening of the International Year of Astronomy, in Paris, in January 2009, and as such it will be reported on in the next biennial JIVE report.

3.5 EVN-NREN, e-VLBI workshops

In September 2007 the yearly e-VLBI workshop was hosted by the MPIfR in Bonn. With 58 participants it was very well attended, and it covered a large number of technical topics. A half-day EVN-NREN meeting, and a one-day meeting on the EXPReS Joint Research Activity FABRIC followed this workshop.

Shanghai Observatory hosted the 7th international e-VLBI workshop in June 2008. Again, attendance was high, and one of the things that became clear through the many excellent presentations was that e-VLBI in Asia is in full development. This workshop featured a live e-VLBI demo involving Shanghai, Kashima and the Australian LBA, with the data being correlated on the Australian DIFX software correlator, and a live 8-Gbps data transfer via the ‘normal’ (non-lightpath) networks of the Scandinavian NRENs. As a result of an open discussion on data formats, the decision was made to form a task force, led by Alan Whitney, to determine a standard VLBI data format, with the aim of enabling seamless integration of different telescope networks, data acquisition platforms and correlators.

back to index