Implementing a Synthesizer in an FPGA

Thank you for your feedback guys, it is very much appreciated.

I posted the latest update video demonstrating the envelope generator’s features.

Please note that something strange happened to the sound during the long attack demonstration, when the video got encoded. In real life, the attack sound is smooth.

One thing I didn’t mention in the video, is that I wasted quite a bit of time chasing a bug that is actually in the S/PDIF DAC (I really need to change my DAC).

In some cases, the beginning of the attack looked like it was increasing exponentially:

instead of asymptotically as expected

It turns out my DAC seems to “go to sleep” when the encoded signal is 0 (maybe any DC value?), even though the S/PDIF signal itself is still active.
It takes the DAC a couple of milliseconds to get back to normal.

I confirmed, using SignalTap, a part of the IntelFPGA software suite that lets us probe signals inside the FPGA, that the envelope and modulated signal had the correct shape even when the oscilloscope was showing otherwise.

6 Likes

Not sure how I missed this thread but damn this is good stuff.

4 Likes

This is awesome, are you going to add polyphony? Please keep posting your mini updates!

3 Likes

Thank you.
Yes, I am planning my design to enable adding polyphony in the future, but right now, I am still working on adding the second oscillator.
Baby steps.

2 Likes

The latest update is up on YouTube.

What takes time at this point is not so much implementing the new features.
The second oscillator is a cut and paste of the first one, and the mixer is just a couple of multipliers and an adder.
It’s the planning and architecting that take time.
It’s a bunch of little decisions.
How many bits of precision for the pulse width modulation? For the mixer?
What will control what?
Which oscillator syncs which one?
How many envelope generators?
How many LFOs?
What can LFOs and Envelopes modulate?
Linear or exponential frequency encoding? (The equivalent of Volt/Octave or Hz/Volt, each method has its own advantages and disadvantages.)
Etc., etc.

But it is coming along and getting more and more musical with each added feature.

So far I have used up just over half of the resources of the EP4CE6 FPGA that I started with.
I expect I will soon run out of multipliers, several of the potentiometers or VCA you see in an analog synth are emulated using at least one multiplier in the FPGA, and there are only 15 multipliers in the small device that I am currently using.
So far I have used 8 of the 15 multipliers : 4 in the filter ( one in each of the four stages), 1 in the envelope generator, 2 in the mixer and 1 to control the second oscillator’s frequency offset.

When I add the LFO, it will be expensive in multipliers because you need one multiplier for each thing that the LFO can modulate, to control the depth of the modulation.
Same thing for everything you want to modulate with an envelope.
So I will soon be migrating to the EP4C15 which has almost four times as many multipliers.

6 Likes

This is coming along nicely, but it also makes me more and more curious as to how FPGA code looks like. Will you please, please, please show us a bit of it? I would be satisfied to just see snippets of it.

How does an oscillator look like?
How does the controller code look like (a knob to set a pitch / frequency etc.)?
And concerning architecture of the device: how do you time all operations so that they are executed with a certain timing related to the underlying sampling frequency of the device? Is this interrupt based or based on some other principle?

This would be interesting to know just to get some insight in to how FPGA code looks like as opposed to Von Neumann architecture based approaches in e.g. C/C++.

4 Likes

Sorry for taking so long to reply, I was trying to find a good way to answer without creating a whole new series of videos or writing a book, but I failed and ended up writing a book and some new code for demonstration purposes.
And that still only ends up scratching the surface of the questions that you asked.

So how does it all look like?
At the moment, it’s a mess.
A lot of what I have done so far is hacked-together proofs-of-concepts.
But everything is still written with an eye to the future, so traces of my “secret sauce” that I am not yet ready to reveal to the world, are woven throughout every modules. (The building blocks are actually called modules in Verilog)

I was looking for parts that I could show you, but either they would be too trivial to really explain anything or too convoluted and spaghetti-like to clearly explain the underlying concepts.

So I took some time to write a simple, but representative oscillator module and heavily commented it to explain some of the syntactic quirks of Verilog and the underlying concepts.

And now, to answer some of your questions a little bit more precisely while staying unfortunately too abstract:

Everything in FPGA is described with HDLs, Hardware Description Languages, the two most used HDLs are Verilog and VHDL, the former is what I, and most of North America except the military, use, and the latter is what most of Europe uses. I’m not sure of the rest of the world’s preferences. But like EMACS over vi, Verilog (or specifically its descendant SystemVerilog) is clearly superior. :innocent:

So before you can describe an oscillator, you have to imagine it.
The goal of an oscillator is to produce a periodically varying value, and since we are in the digital domain, this happens in a discreet manner and a new value has to be produced regularly at a given sampling rate.
The oscillator has two parts, the first of which is a phase value that increments for every sample and loops around back to zero at the end of each period.
The easiest way to implement that in digital logic is with a counter, you are probably familiar with the 4-bits counters available in the 4000 or 74xx series of digital logic chips. They count up from 0 to 15 and back to zero.
So just put a clock on a counter and you have the first half of an oscillator, in the case of the 4-bit counter, it has a fixed frequency which is one sixteenth of the sampling frequency, no too useful, but a start.

The second part of an oscillator, is what forms the actual waveform. To each phase value corresponds a waveform value. In the case of a sawtooth ramping up, the value of the phase itself can serve as the waveform output, for a sine wave, one would typically have one cycle of a sine wave stored in a memory and use that as a lookup table (refinements could include storing only a quarter of the period to save memory and using the waveform’s symmetry to compute the other three quarters of the period, linear interpolation could also be used to limit the size of the lookup table).

With a 4-bit counter, the size of the lookup table would not be a problem, but in real life our phase counter would have to be significantly larger.
How large?
Let’s see, with a 48kHz sampling rate if we want to generate the lowest MIDI note at around 8 Hz (round numbers used for simplicity), our counter needs to be at least able to count to 6000 so at least 13 bits.
But how do we get the frequencies above 8Hz? The idea is to use an adder instead of a counter. That way for each sample the phase can increase by any given number at each sample. If we want to generate a note one octave above that 8Hz, we simply increment the phase by 2 every sample and it will take 3000 samples for our 13-bit phase to roll over and start a new period. 48kHz/3000 = 16Hz

Right away we notice that 13-bits doesn’t give us enough precision: increment by 1 and we get one note, increment by 2 and we get the note one octave higher. So how do we get the notes between those two? We simply increase the width of our phase, lets make it 20-bits wide instead of 13. To get the 8Hz note we now increment by 128 at every sample and to get the octave above that at 16Hz we increment by 256. We now have 128 values to chose from to get the 11 notes in between those two (it might still not be quite enough to get within a few cents of each note, but remember this is the lowest octave supported by MIDI, it is between 8Hz and 16Hz and is below the hearing range of most humans.
With 20 bits we have plenty of precision for the audible notes - already at the next octave we have 256 different values to get close to the 12 notes).

So what does this look like in SystemVerilog?
This is what it looks like in SystemVerilog:
(Sorry I don’t know ho to get Discourse to do syntax highlighting for Verilog.)

// osc.sv
// This module is the back end of the oscillator.
// It performs the phase incrementation and creates simple waveform that do not require lookup-tables.
// For flexibility, the value of the phase increment is computed elsewhere and provided as an input to this module.
// Output waveforms: sawtooth and square wave
// Outputs are unsigned values.
// For simplicity the output samples have the same width as the phase.

module osc
  #(
    // Modules can be parameterized for flexibility, and to enhance re-usability.
    // For example if we wanted to turn this into an LFO we could just increase the PHASE_WIDTH parameter to achieve lower frequencies.
    parameter PHASE_WIDTH = 20
    )
   (
    // Here we declare the module's input and outputs.
    // Inputs are typically just wires driven by other modules higher up in the hierarchy.
    
    // Most digital logic is synchronous, meaning all changes happen on the rising edge of the system clock.
    input wire                   clk,
    
    // It is good practice to have a reset signal on all registers so we know what state the circuit starts in.
    // The Altera FPGA hardware is optimized for asynchronous active low resets.
    // (Circuit goes into reset when the rst_n pin goes low.)
    // It is customary to indicate active low signals by appending _n  
    input wire                   rst_n,
                                        
    // Buses are declared with this syntax [a:b] meaning that the bits will be numbered from b to a.
    // It is customary to put the largest number (most significant bit) on the left.
    // Most of the time, the bits in a bus that is N bits wide will be numbered from 0 to N-1
    // We read [N-1:0] as "N-1 down to 0".
    input wire [PHASE_WIDTH-1:0] increment_in, // This value is added to the phase accumulator at every clock cycle.

    // It is good practice to have registers as outputs. It helps avoiding several potential timing issues.
    // reg is the key word to declare registers.
    output reg [PHASE_WIDTH-1:0] ramp_up_out, // The ramp waveform.
    output reg [PHASE_WIDTH-1:0] square_out    // The square waveform.
    );
   
   // Here we declare registers to store the state that need to be kept between each clock cycle.
   // In this case we only have our phase register.
   
   // The phase is an unsigned number incremented for each sample and saved. 
   reg [PHASE_WIDTH-1:0]         phase;

   // Here begins the main infinite loop
   // This "always" block statement says that what is inside of it will happen at every rising edge of the clock ( @ posedge clk )
   // or at any falling edge of the reset ( @ negedge rst_n ). (Asynchronous, active low reset.)
   // It is good to have an asynchronous reset so that at power up, the circuit can be kept in reset until the clock stabilizes.
   always_ff @( posedge clk or negedge rst_n ) begin

      if ( !rst_n ) begin
      
         // Active low reset asserted.
         // Clear all our outputs and the phase counter during reset so we start from a known state.
         phase       <=  'd0; // The left arrow "<=" is the "non blocking" assignment which we use inside always blocks.
         ramp_up_out <=  'd0; // 'd means a decimal value, 'h is for hexadecimal and 'b for binary, in hardware we often need to express precisely the value of every bit.
         square_out  <=  'd0; // You can also specify the width of the number in bits by adding a number before the apostrophe e.g. 3'b010 for a three bit wide number two
                              // When no size is specified but the apostrophe is present, the size is assumed to be the same as the left hand side.
                              // If the apostrophe and base is not present, the number is assumed to be a 32-bit decimal number.
                              // It is often necessary to exactly specify the size of numbers, and the compiler will issue warnings when it is not specified.
                              // But it can also be dangerous to use explicit size as, for example, if you first specify a number as 3'd7 but later realize it should be 8.
                              // If you just change it to 3'b8, you get 0 instead. It could take a while to find that bug because your brain will read 3'd8 as 8.
                              // You will get a compiler warning, but if you are not careful you will get tons of warnings (several thousands in large projects) 
                              // so you may not notice the new warning.
                              // That is why I try hard to eliminate as many warning as possible and know the current count of "acceptable" or "unavoidable" warnings
                              // So I'll notice when there is a new one. But it isn't always easy to keep the warning count low.
                              // For example, to make sure this very simple file was free of syntax error, I threw it in Quartus and compiled it and got 12 warnings, 6 of which "critical".
                              // Don't worry, this file is clean, the warnings are just because other parts are missing to make a project out of this files.
                              // E.g. you have to define your clock speed, which pins of the chip are connected to the inputs and outputs, 
                              // and funnily enough how many processor cores to use while compiling (5 of the 6 non-critical warnings are about this).
                              // Compilation of large projects can take several hours, you may not want to render your computer unusable (or have it melt) because the compiler is using all the cores at 100%.

      end else begin

         // We are now out of reset and operating normally.
         // We do the following at every rising edge of clk.
         
         // Add the increment to the phase
         phase         <= phase + increment_in; // This will automatically roll over to 0 when the count exceeds the size of the phase register.

         // "Compute" the outputs waveforms from the phase

         ramp_up_out   <= phase; // It's just the phase as-is, really.

         // Replicate the most significant bit for all bits so we get 0 for the first half of the period and 2**PHASE_WIDTH-1 for the second half.
         // This syntax: {a {b} } means concatenate b, a times.
         // phase[PHASE_WIDTH-1] is the most significant bit of phase. 
         square_out    <= { PHASE_WIDTH { phase[PHASE_WIDTH-1] } };
         
      end // else: !if( !rst_n )
      
   end // always_ff @
   
endmodule : osc // Useful in large files with multiple modules, but normally you only put one module per file.

// Again I hope I didn't write all this for nothing and that someone read the whole thing.
8 Likes

Thanks for this very clear description. I’m not familiar with the hardware or with HDL but this gives a crystal clear picture of the considerations involved in designing an oscillator on an FPGA.

1 Like

Thank you for taking the time to write your post. The example code is enlightening!

I take it if you want N oscillators you basically use the module N-times and the FPGA-hardware will be configured to have N of them that listen to the same clock, making the oscillators work in parallel?

One more question: In my software vocoder I basically calculate one output sample for each cycle of the sampling frequency. This is done by sending an input sample through all 31 carrier and one input sample through all 31 modulator filters, their envelop efollowers, quite a few multipliers in the matrix and summing functions etc. This has to occur before the next input sample arrives from the AD converter. Basically all calculations are done one after another, there is no parallelization being used (as there is only 1 processor core). I guess an FPGA can start a set of calculations based on an interrupt from a AD converter and has to do the same computations (but quite a few of them in parallel) in the time it has before the next interrupt occurs. Because the 62 filters and 31 envelope followers etc. can operate in parallel, the clock frequency of the FPGA need not be as high as the CPU in a classical Von Neumann implementation however. Would it be possible to implement something like that as I imagined or are there best practices that would take a different route?

4 Likes

You did not write it for nothing and thank you.

5 Likes

Yes, that is one of the fun aspects of FPGAs, if you want multiple copies of a specific piece of hardware, be it an oscillator, a UART, an I2C interface, a PWM output, etc., you just copy it as many times as you need, and all the copies run in parallel.
Of course, at one point you will run out of resources in the FPGA, like I mentioned before, in the low end FPGA I started this project with, there are only 15 multipliers, so you have to use them wisely.

There is, however, another important way by which FPGAs can do multiple things at the same time, it’s called pipelining.

If we go back to the oscillator example above, where the phase is incremented and the output waveforms are created.

// Step 1
phase       <= phase + increment_in;
// Step 2
ramp_up_out <= phase;
square_out  <= { PHASE_WIDTH { phase[PHASE_WIDTH-1] } };

What was probably not immediately obvious if you are not familiar with Verilog, is that even though those three lines of code look like they execute one after the other, they do, but they also execute simultaneously all at the same time.
By that, I mean that on the first clock cycle, the phase is incremented, and at the second clock cycle the ramp and square wave output registers are written with the new values, but the phase is also being incremented again.

So it takes two cycles to compute the new outputs, but new outputs are still available every cycle because a new computation is started every cycle. Like in a pipeline where new liquid enters at one end at every second, and new (old) liquid comes out at the other end.

The idea here is that even though it looks like a normal programming language, instead of having a single processor go over each line one after the other, dedicated hardware is laid down to execute each line, and once that hardware is there, it is available to do its thing at every cycle.

That is the silver lining, you may have limited resources, like only 15 multipliers, but you can use them all at every clock cycle.

For your 31 filters, you can maybe implement a single programmable filter and feed it with the 31 different control values (center frequency and whatnot) one after the other at every cycle, so in 31 cycles plus how many cycles your algorithm takes to process one sample, you get your results for the 31 filters (that also mean that each additional filter only adds one clock cycle, so a 63 filter thingy might be conceivable). An by clock cycle here, I mean the fast system clock, not the sampling rate.

A reasonably well designed circuit should be able to run at at least 50MHz in a low end FPGA, so for a 48kHz sampling frequency, you have about a thousand cycles to compute the next sample. And for every additional sample period of latency you are willing to accept, you get an additional thousand cycles of computation, it doesn’t matter if it takes more than one sampling period because you are still producing one new sample every sampling period.

Unless, and that is a big one, there is feedback in your algorithm, i.e. an output sample depends on the immediately preceding output sample, in that case, you need to finish your computation in one sample period. But that is still a thousand cycles and you’ll probably run out of resources in a low end FPGA before you can execute a thousand steps.

Remember we are talking about ~$20 devices with only a few thousand logic elements. (About 6000 for the EP4C6 and 15,000 for the EP4C15 - notice the naming pattern?) Each logic element (LE) consist of a programmable four input logic function and one register.
If your samples are, for example, 20-bits wide, you probably consume around 20 registers at every step, so with 6000 LEs, you can probably do less than 300 steps of processing. (These are very rough, back of the envellope, estimates I just came up with, but should give you an idea of what size of FPGA you may need to implement your project.)

5 Likes

Yikes. Thats above my brain level.

2 Likes

Sorry, pipelining is not the easiest concept to explain concisely, but I thought I should mention it as it could be important for @Jos’ application.

Feel free to ask questions if there are parts you think I can clarify.

3 Likes

When you are saying you are using up multipliers and maybe running out on them, it sounds like those are prefabricated parts of the FPGA you are using. Is that correct? Just wondering, would you be able to make more with the rest of the FPGA?

TL;DR: yes.

Unnecessarily long answer:

Yes, the multipliers I am talking about are dedicated circuits in the FPGA. Most FPGAs have some dedicated circuits for a few common operations, the most common are multipliers, (small) memory blocks and carry chains (to make faster adders).

image

Some FPGAs have dedicated hardware for high speed transceivers and some communication protocols such as PCIe, or Analog to Digital Converters (ADCs) or Dynamic RAM controllers, etc. Those dedicated hardware sections are called “Hard IP” (Intellectual Property) as opposed to the “Soft IP” which is what is implemented using the FPGA’s programmable logic.

You can always implement additional multipliers in soft IP, the FPGA software even provides you with a library of pre-made soft IP (they’re called IP cores) to implement multipliers and several other common functions from simple shift registers or multiplexers all the way up to 32-bit processors with cache, interrupt support and custom instructions. Specialized IP cores are also available for purchase from third parties.

I actually spent most of my career at Altera writing IP cores for high speed serial communication protocols so that our customers wouldn’t have to re-invent the wheel. They were protocols, that are used by enough of Altera’s customers to be worth the design effort, but not used by enough of the customers to be worth dedicating part of the device’s silicon to them. (If the hard IP is not used, it is wasted silicon that just increases the price for every customer.)

Implementing a multiplier in soft logic (the FPGA programmable gates) can be quite expensive in resources (that is why they are implemented in hardware). For example, the Cyclone IV devices that I am currently using provides fifteen 18-bit by 18-bit multipliers in hardware. The soft IP that is also provided to implement such a multiplier consumes 437 Logic Elements (out of the 6272 available in the small device I started with) so, even if you used up the whole device just for soft multipliers, you could not even implement as many multipliers as are provided in hardware (15).
Additionally, multipliers implemented in soft logic would be slower than the hardware multipliers (which may be an issue if you are trying to squeeze all the performance that you can out of your circuit). Soft IP also consumes more power for the same function, all good reasons why the FPGA manufacturers provide hard IP blocks.

But in a pinch, if you are short a multiplier or two and have some LEs left, you definitely can go ahead and implement those multipliers in soft logic.

4 Likes

This stuff is cool! Thank you a lot for explaining it in a way that we mortals can digest! :slight_smile:
A few years ago I have looked into FPGAs because there were some ideas to use them for numerical fluid modeling, but I could understand what was going on and then I never read about that again… no idea if that is still a thing. Would be cool though to have a little brick that could compete with room filling super computers!

1 Like

In Vocode-O-Matic I implemented the filters as shift registers like this:

// Direct Form I topology is used for filtering.
for (int i = 0; i < NR_OF_BANDS; i++)
 {
   ym[i][0] = mod_alpha1[i] * (xm[0] - xm[2] - mod_alpha2[i] * ym[i][1] - mod_beta[i] * ym[i][2]);
   //
   // Shift modulator filter taps.
   ym[i][2] = ym[i][1]; ym[i][1] = ym[i][0];
}

Where xm[0] is the modulator signal and ym[i][0]is the output of one of NR_OF_BANDS bandpass filters. mod_alpha and mod_beta are parameters that determine the filter’s center frequency and bandwidth. I iterate over all bands, hence the index i and then in the code following this snippet do something with the resulting signal. This is done each time a new sample arrived from the ADC ( in actual fact I get a buffer with a block of samples, but the processing in principle does not change ). The above code contains 3 multiplications.

There is a similar loop that processes the carrier signal and then the outputs of the respective bands are multiplied by the envelope output of each modulator band ( via a connection/modulation matrix ).

In an FPGA I initially thought I would not want to use 2 of these loops, but process all filter bands in parallel. That would cost 2 * 31 * 3 = 186 multipliers. This means that even with 144 multipliers in my Intel Max 10M50 this can not be done. So how best to approach this problem?

Would it be feasible to multiplex the carrier and modulator filters? I mean, since their architecture is very similar, use one set of NR_OF_BANDS filters and swap between filter tap values, buffering intermediate output values and keeping track of the mod_alpha and mod_beta variables thus reducing the number of multipliers required to 1 * 31 * 3 ? How would you approach this problem?

2 Likes

Mini update on my synthesizer in an FPGA project.

  • I wasted a couple of days trying to add resonance to my low pass filter with no success, I might have to completely change the design of the filter and go for a more academic approach, which is too bad because I was kind of proud of my original design.

  • I added support for this PCM5102 based DAC board.
    https://www.aliexpress.com/item/32824604720.html

The output is DC coupled and looks a lot more like what the circuit actually synthesizes.
The following oscilloscope screen captures show the new DAC’s output in magenta at the top and the S/PDIF DAC output in cyan at the bottom.

Triangle waves are more triangular:

Square waves are flatter:

Ramp up sawtooth waveforms are straight and actually ramp up.
(For some reason the S/PDIF output is inverted. Is it due to the AC coupling?)

The random waveform looks the most different.

There are two downsides to the PCM5102 DAC, first there is a bit more high frequency noise, but I should be able to filter that out with a simple RC filter and the output is delayed about 88µs compared to the S/PDIF output:

88µs is a negligible delay, but if it ever becomes a problem, the PCM5102 has a low latency mode that should reduce the delay to being shorter than the S/PDIF DAC delay.

2 Likes

88µs is way below anything you should worry about. Let’s say it’s roughly 1E-4S, and the speed of sound is somewhat less than 5E2 metres per second. So in that time sound can travel less than 5E-2 metres, 5cm or 2 inches. The human ear and aural apparatus is amazing but it can forgive such discrepancies.

2 Likes

Disclaimer: I am not an expert in Digital Signal Processing in FPGAs. I have never done it professionally.

The first thing to do is determine the data types and precision required.
I’m assuming in software you’re using float variables, maybe even doubles.
The hardware multipliers available in the Cyclone IV and MAX10 FPGAs are 18-bit x 18-bit integer multipliers, only recent high end FPGAs have floating point hardware, and even that is only single precision float IIRC.
So you have to ask yourself: can I use fixed point variables instead of floating point?
It might still be simpler to stick with floating point, at least at first as a proof of concept.
The IP libraries provide modules to do basic floating point operations, you just need to define the exponent and mantissa sizes that you want.
So what size of mantissa do you need?
If 18-bits are not enough, each multiplication will potentially cost you four hardware multipliers (or more if 36-bits are not enough).
You can save on hardware multipliers if one of the operands is 18-bit or less and only the other is more than 18-bits. (Do samples and coefficients need to have the same resolution?)

As for the circuit architecture, I don’t think you need to, or should, unroll the loop.
The circuit should be fast enough to process the 31 bands sequentially, in a pipelined fashion, in less than one sample time.

FPGAs have small block memories that are perfect to store the coefficients (mod_alpha and beta) and the input and output samples ( xm and ym) that need to be fed sequentially.

The block memories are small but very flexible and very useful.
In the MAX10 these small memories are called M9K because they hold 9 kilobits, yes bits.
They can be configured from 8k deep by 1-bit wide all the way to 256 word deep by 36-bit wide:
image
Another very useful feature of these small block memories, is that you can write to them and read from them at the same time, which is often necessary in pipelined circuits.

Assuming 36-bits are enough precision, you would need one M9K for each of your coefficient.
Depending on exactly how they are used, xm and ym could be stored either in M9Ks or in registers.

Getting the pipeline timing right will be a bit tricky (e.g. you have to start reading from the memories two cycles before you need the memory’s output, because of the memory’s internal latency), but you should end up with a relatively compact circuit, that you might be able to fit in a smaller and cheaper FPGA than the 10M50, should you ever want to commercialize the product. Or you’ll end up with more space to add other features.

2 Likes