Wednesday, June 10, 2015

Getting started with ARM Cortex M4 - running your first project

As with every other new language/hardware, we start off with the hello world project. To do that, first open Kinetis Design Studio(KDS). Select File -> Import. In the Import wizard, select "Project of projects" -> Existing projects sets and click next. 


Browse the C drive to find the frdmk22f folder in the Freescale installation path. Find the hello world folder in the demo apps directory and open the kds folder. My installation path is as follows. "C:\Freescale\KSDK_1.2.0\examples\frdmk22f\demo_apps\hello_world\kds". Select the file named "hello_world.wsd" and import it into the project explorer of KDS. 


Now two different projects appear in the project explorer. Select the "kdsk_platform_lib_K22F51212", right click on it and select "Build project". The project should build successfully. If there is any problem, right click and select "Clean project" and then "Build project".


Now, right click on the "kdsk_platform_lib_K22F51212" project and select Delete. Do not check the "Delete project contents on disk" option. Now we have all our platform libraries built and ready. Next, right click on the "hello_world_frdmk22f" project and select "Clean project" and then "Build project". Now, the project should successfully build.

Next, we'll need to start a terminal window. I prefer putty. Figure out the com port of your development board to use in putty. Use 115200 kbps as the baud rate. Now, debug the project in KDS using GDB PEMicro debugger (Installing the debugger in KDS was already done in the previous post). Once in the debug perspective, press resume button. You should be able to see the output "Hello world!" in the terminal.



Cheers,
Vijay.

Getting started with ARM Cortex M4 - setting up debug environments

In this post, I'll write about setting up debug environments for Kinetis Design Studio(KDS). These tools are required for on chip debugging at runtime of the code.

PART A: Installing the GNU ARM Eclipse debugger
First we'll install the GNU ARM Eclipse tools, which sets up the OpenOCD. To do this, first open KDS. Select Help -> Install New Software... In the Install window, click the Add button on the top right corner.


The Name text area is optional. Type "http://gnuarmeclipse.sourceforge.net/updates" in the Location text area and click OK.


Expand the GNU ARM C/C++ Cross Development Tools and select the packages as shown below.


Click Next and accept the licence agreement and then Finish. The installation will take place in the background. After it is done, KDS needs to be restarted for changes to take effect.

PART B: Installing the PEMicro debugger
The following procedure will work both on KDS or any other standalone version of eclipse (Eclipse Luna). Follow the same procedure as above, except for the Location text area in the Add Repository window. Here, you have to type "http://www.pemicro.com/eclipse/updates". Rest of the procedure is the same.

To test the successful installation, we will write a basic hello world program and try to debug it in the next post.

Cheers,
Vijay.

Tuesday, June 9, 2015

Getting started with ARM Cortex M4 - working with Freescale FRDM-K22F

I’ve a Freescale FRDM-K22F development platform and we’ll setup the workspace for working and debugging. A prerequisite to this is to have the groundwork established as mentioned in my previous post – “Getting started with ARM Cortex M4 - setting up the work space”.

PART A – Running out of box demos
1.       Download and install the mbedOpenSDAv2.1 USB drivers found at

2.       Plug in a USB cable (standard micro USB cable) from a USB host to the OpenSDAv2.1 Micro-AB USB connector (J5). The FRDM-K22F will be powered by this USB connection. FRDM-K22F comes with the mass-storage device (MSD) Flash Programmer OpenSDAv2.1 Application preinstalled. It will appear as a removable storage drive with a volume label of MBED. The MSD Flash Programmer also includes a USB virtual serial port which requires an .INF file for proper installation in Windows. The necessary .INFfile is available in the mbedOpenSDAv2 .1 USB drivers (Step 1).

3.       Determine the symbolic name assigned to the FRDM-K22F virtual serial port. To do this in Windows, open Device Manager and look for the COM port named “mbedSerial Port”. My device is assigned the port 5.


4.       Open the serial terminal emulation program of your choice. I use Putty.

5.       Select the Connection Type as “Serial”. Set the com port as per step 3 and Speed(baud rate) to 115200. Hit Open.


6.       Press the any key on the keyboard or the SW3 pushbutton on the board to display the following message:
Select from the following menu:
 I. Bubble level –Single axis level using the red and green LEDs to indicate when the board is unlevel.
 II. eCompassdemo –Compass using the combined magnetometer and accelerometer to indicate the heading of board in reference to magnetic north.
III. USB CDC demo –Second serial terminal. (For additional details on the driver installation, refer to the instructions on the next page.)
IV. Metal Detector demo –Metal detector demo that indicates the direction and distance to a foreign metal object using the variable LED brightness as the indicator.
V. Stopwatch demo –Uses the real-time counter (RTC) to implement a stopwatch with 1/100th of a second precision.
VI. Air mouse demo –USB mouse using the accelerometer to determine where/how to move the cursor

7.       Enter a number key (1–6) on the keyboard to run each option from the Quick Start Demo. Further information will be displayed on the terminal as the demo runs.

PART B – Programming using the online mbed compiler
All ARM mbed enabled devices can be programmed using the online “open mbed compiler”. This is a very easy to use tool for novices. Visit the following link and create a mbed online account.

Next use your login to open the mbed online compiler. The online tool automatically detects your development board. In the top-left corner, click on the import button to open up the import wizard. Select the first project, “mbed_blinky” and click import (on the top-right corner). This project will be imported into your Program Workspace. Select the program in the Program Workspace and click on the compile button to perform a compile and download the binary file. Once the binary file is downloaded, drag and drop it into the “mbed” Removable device which shows up in My Computer. Then press the reset button on the development board and release it. The “mbed_blinky” program will start to run.

PART C – Programming using the Kinetis Design Studio
Freescale provides its own set of software IDE for development using the prototype board. This IDE is called the Kinetis Design Studio. The Kinetis Design Studio IDE is a complimentary integrated development environment for Kinetis MCUs that enables robust editing, compiling and debugging of your designs. Based on free, open-source software including Eclipse, GNU Compiler Collection (GCC), GNU Debugger (GDB), and others, the Kinetis Design Studio IDE offers designers a simple development tool with no code-size limitations. Furthermore, Processor Expert software enables your design with its knowledge base and helps create powerful applications with a few mouse clicks. Additional and new device support is installed through the Kinetis Software Development Kit (SDK). KDS can be downloaded from the following link. You may have to register to be able to download.


After the KDS and its dependencies are installed, you’ll get to start directly programming using the editor and debug the hardware with the OpenOCD tool.

To develop a sample project, select File -> New -> Kinetis Project. Give the project a name and click Next. Select Boards -> Kinetis -> FRDM-K22F and hit Finish.


Now, everything is good to go and you can start building code.

Getting started with ARM Cortex M4 - setting up the work space

The next series of posts are targeted at beginners to the ARM family. We will first need to install the required tools to get started.

The first thing required to make everything working is Java. Make sure you get the latest tools and set automatic updates on for all software you use. Also, make sure to download a 64 bit version of Java, if your processor is 64 bit or a 32 bit version otherwise. Java can be downloaded from the following link.

After Java is installed, download the latest version of MinGW tools for the gcc compiler. Also make sure the MSYS folder is inside the MinGW folder. Try to install all the software in their default paths. MinGW can be installed using the following setup.

Now, we will proceed with setting up the GNU tools for ARM embedded processors. Download the latest version of the executable file (.exe). I’ve downloaded the “gcc-arm-none-eabi-4_9-2015q1-20150306-win32.exe“ version, which is the current version at the time of this write-up. The tools can be downloaded from the below link.

We will also need to install the OpenOCD (Open On Chip Debugger) required for debugging. This can be downloaded from the following link.

And we need the eclipse plugin for the OpenOCD too, which can be downloaded from the following link.

Next, we need to add the Java, GNU tools for ARM and compiler variables to the system path to let the system know their locations. Right click on “My Computer” and select properties. Click on the “Advanced system settings” on the left pane to open up “System Properties” window.  Click on the “Environment Variables” under the “Advanced” tab. Browse for “Path” under “System variables” in the “Environment Variables” window, select it and click the “Edit” button. Add the following text(without quotes) to the beginning of the Path.
“C:\ProgramData\Oracle\Java\javapath;C:\MinGW\bin\;C:\MinGW\msys\1.0\bin; C:\Program Files (x86)\GNU Tools ARM Embedded\4.9 2015q1\bin\;C:\openocd-0.8.0\bin;”

Finally, we need to install an IDE. I recommend Eclipse as it is relatively simple and widely used. Eclipse Luna is the C/C++ version for Windows OS and can be downloaded from the link below.

Now everything should work perfectly. Go ahead and make a sample C++ project for testing the installation. Additionally, you could Fork somebody’s GIT repository (that has a makefile) and test if it works too.

Cheers,
Vijay.


Wednesday, December 10, 2014

All about caches

Cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere. If requested data is contained in the cache (cache hit), this request can be served by simply reading the cache, which is comparatively faster. Otherwise (cache miss), the data has to be recomputed or fetched from its original storage location, which is comparatively slower. Hence, the greater the number of requests that can be served from the cache, the faster the overall system performance becomes. To be cost efficient and to enable an efficient use of data, caches are relatively small. Nevertheless, caches have proven themselves in many areas of computing because access patterns in typical computer applications have locality of reference. References exhibit temporal locality if data is requested again that has been recently requested already. References exhibit spatial locality if data is requested that is physically stored close to data that has been requested already.

Principles of locality:
 A phenomenon describing the same value, or related storage locations, being frequently accessed. There are two basic types of reference locality – temporal and spatial locality. Temporal locality refers to the reuse of specific data, and/or resources, within a relatively small time duration. Spatial locality refers to the use of data elements within relatively close storage locations.
Temporal locality
If at one point in time a particular memory location is referenced, then it is likely that the same location will be referenced again in the near future.
Spatial locality
If a particular memory location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future. 
Reasons for locality:
• Predictability
• Structure of program – we are used to writing code in such a fashion
• Linearity of data structures – eg.arrays
•  Programmers are responsible for moving data between disk and memory through file I/O.
•  Hardware is responsible for moving data between memory and caches.
•  Optimizing compilers are responsible for generating code that, when executed, will cause the hardware to use caches and registers efficiently.

Memory Hierarchy:
A "memory hierarchy" in computer storage distinguishes each level in the "hierarchy" by response time. Since response time, complexity, and capacity are related, the levels may also be distinguished by the controlling technology.



Operation:
Hardware implements cache as a block of memory for temporary storage of data likely to be used again. A cache is made up of a pool of entries. Each entry has a datum (piece of data) – a copy of the same datum in some backing store. Each entry also has a tag, which specifies the identity of the datum in the backing store of which the entry is a copy.
 
When the cache client (a CPU, web browser, operating system) needs to access a datum presumed to exist in the backing store, it first checks the cache. If an entry can be found with a tag matching that of the desired datum, the datum in the entry is used instead. This situation is known as a cache hit. The alternative situation, when the cache is consulted and found not to contain a datum with the desired tag, has become known as a cache miss. The previously uncached datum fetched from the backing store during miss handling is usually copied into the cache, ready for the next access.  The percentage of accesses that result in cache hits is known as the hit rate or hit ratio of the cache. 

Cache metrics:
Hit rate = 1 – miss rate
Hit rate = #references that hit / #references
Miss rate = #references that missed/#references
Average access time = (hit rate * hit time) + (miss rate * miss time)
Average miss penalty = average access for next level down
eg. Avg miss penalty of L1 = avg access time of L2
Structure of cache:
The structure of a cache line is shown below.

flag bit
tag
data block
An instruction cache requires only one flag bit per cache row entry: a valid bit. The valid bit indicates whether or not a cache block has been loaded with valid data. A data cache typically requires two flag bits per cache line – a valid bit and a dirty bit. Having a dirty bit set indicates that the associated cache line has been changed since it was read from main memory ("dirty"), meaning that the processor has written data to that line and the new value has not propagated all the way to main memory.
 The data block is the actual block fetched from the backing storage. The tag is the address (part of the full address of the data in the backing storage). Size of a cache is the amount of main memory data it can hold which is 
Number of bytes in each data block * no of data blocks per cache level
The memory address is split into tag, index and byte, as shown below.
tag
index
byte

For 8 cache lines, the last digit of address specifies the byte within a block (0 to F) – 16 bytes
The last second and third bits specify index addresses from 00 to FF – 256 index values.
The remaining MSB values represent tag, which is matched with the physical address for read or write operations.
Design dimensions:
1. Block size: this is read/write granularity to the next level down.
Increasing the block size:
a. Exploits spatial locality by prefetching likely to use data
b. It could also hurt, as cache fills take longer, and also reduces amount of storage for other blocks

2. Transfer size: how much data is brought in per block, as data brought in might not be full block size

Associativity:
The associativity decides where you can place a block fetched from backing storage.
1. Fully associative: Block can be placed anywhere.
Disadvantage:
a. Have to check every single block for a match (using the tag)
b. Consumes power and takes time
Advantage: better hit rate
2. Direct mapped: block can go to only one place, which is determined by block ID
Disadvantage:
a. Conflicts: addresses ending in 0, 8, 16, 24, 32,… are placed in same location having ID 0; thus higher miss rate
Advantage: faster access time (than fully associative), smaller tag size (and hence less space and less power per access)
3. Set associative: instead of forcing data to go to only one location, we allow hardware to choose to put data in one of n available locations based on index (ID), it is called n way set associative.

4. Multiple direct mapped caches: we can have multiple direct mapped caches, and choose data from a particular cache based on tag (address). Hardware decides which direct mapped cache should be used to put the current block.
Advantage: more flexibility and better hit rate than direct mapped
Disadvantage: longer access time and more power dissipation than direct mapped
IT IS IN BETWEEN Direct mapped and fully associative
Size of cache = block size * #of associativity * #of sets
Bigger the better = for better hit rates
Bigger the worse = for power consumption and access times

Replacement policies: (when bringing a block in, what to throw out?)
During a cache miss, the CPU usually ejects some other entry in order to make room for the previously uncached datum. The heuristic used to select the entry to eject is known as the replacement policy. 
For several years, Belady’s min was assumed to be the gold standard. But, it was found out that Belady’s min is NOT optimal for hardware processor caches.
Belady’s min: when deciding which data to cache, look into the future and decide if it has to be cached or not. If that data is not going to be used for a long time, DO NOT cache it. Instead, transfer it directly to the processor. 
Oracle found that “looking into the future” is not a better way. A better replacement policy is "least recently used" (LRU) which replaces the least recently used entry.

Inclusion:
I’ve got the cache; I’ve got the backing store. I make copy of the block from the backing store to the cache, and a copy of it still remains in the backing store. This is called inclusion.
Inclusive caches:
If block is copied from L2 cache to L1 cache, its copy also remains in L2.
Exclusive cache:
Block is moved from L2 to L1, deleting the original copy in L2.
Non-inclusive caches:
Data might be in both caches.

Cache problems:
1. Content management heuristics: what to cache?
2. Logical organization: where to put it?
3. Consistency management heuristics: how to maintain it? – keep a cache consistent with itself, other caches and backing storage.
4. Due to virtual memory, two different virtual addresses could refer to the same memory location. All updates to that memory location should happen in program order. The processor should ensure that only one copy of the physical memory is present in the cache in any given time.
Solution to point 4 - Split caches:
Use separate instruction and data cache. Twice the bandwidth. 2 single port caches are easier to implement compared to 1 two port cache. Instruction cache can be made read only. Instruction cache can be used during fetch stage and data cache could be used during memory stage.

Write policies:
1. Write through: every write in the cache goes “through” to memory => write becomes very slow
Solution: use write buffers (which are logical part of memory, physical part of cache). Data to be written to the memory are dumped into the write buffers, and once in a specified cycle, the buffer dumps it back to the main memory.
Two issues with write buffer: what if the buffer is full? – stall and wait until not full
What if I read (or write) and get a cache miss? – 
If write buffer, drain it.
If write cache, it has a tag and we can read data from it => search it and find the data.
WRITE BUFFER CONVERTS LATENCY PROBLEM TO BANDWIDTH PROBLEM – we can’t make memory faster, as is it costly and has latency problems. Instead we use write buffers and improve average latencies.
2. Write back: when a data in cache is modified, do not write it back immediately to main memory. This is called “lazy write”. => no problem even if main memory is slow. Reduces unnecessary writes to main memory.

Problem: data in cache and main memory are not consistent (inconsistency) – how to solve it?
Dirty bit – it is just a Boolean variable, which indicates 1 if the data in the cache is modified
When does the data get written into the main memory? – when a read miss occurs and some block has to be evicted to give space for the new incoming block. But now, every read miss is potentially going to take more time than before – first it has to write the evicted block into main memory, then it has to bring the new block into the cache.
Even if there is a write miss, the processor has to check if the data in the cache below is dirty. If it is dirty, then the data has to be written to main memory (or the cache even below this) and then a fresh copy has to be placed. Also, the dirty bit of this fresh copy has to be marked 1.
In general: write back – better performance. Write through – simpler in terms of programming model.

Cache miss types: (called the 4 c’s)
A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency (the delay between the miss and actual availability of data). There are three kinds of cache misses: instruction read miss, data read miss, and data write miss.
Instruction read misses are going to take the most amount of time. This is because the processor has to wait for the instruction to be available for proceeding.
Data read misses take the next least time, as till that data is available, the processor can execute instructions which are independent of the requested data.
Data write misses are going to take the least amount of time. This is because, these writes can be queued into the write buffer and the processor needs to wait only when the buffer is full.

Misses can be classified into 4 types:
Compulsory miss: (also called cold start misses) occurs on initial startup, when no values are available in the cache. Can be reduced by prefetching data and increasing block size.
Capacity miss: occurs because of small cache size. Increase the cache size to solve this.
Conflict miss: occurs even if the cache has big size but lacks associativity. Increase associativity to solve this.
Coherence: due to coherence engine.

Addressing schemes:
There are 4 schemes:
1. Virtually indexed virtually tagged: both the index and tag are obtained from the virtual address.
Advantage: very fast lookup time
Disadvantage: susceptible to virtual address aliasing
2. Physically indexed physically tagged: both the index and tag are obtained from physical address. Used in L2 caches and beyond (as speed of L1 cache is so fast as only VIPT is suitable).
Advantage: no aliasing problems
Disadvantage: longer latency

3. Virtually indexed physically tagged: here index used is from the virtual address. By the time the cache is searched for and the matching cache line is found, the address translation is also done. Now, the tag stored in the cache line is matched with the physical tag and then it is decided if it is a cache hit or a miss.
Advantage over PIPT: lower latency, as virtual index is used to search the cache simultaneously when the address translation is taking place. However, we have to wait till physical tag is obtained after translation to decide if it is a hit or no.
Advantage over VIVT: there is no aliasing issues and coherence is good.
4. Physically indexed virtually tagged: this is useless, as virtual tagging is not required only we have the physical address in hand.

Cache coherence and consistency:
Memory coherence: memory behaves “rationally”
Memory consistency: programming contract of hardware. Allows programmer to define behavior
Cache coherence: hardware implementation. Guarantees memory coherence
Cache consistency: the principle – cache systems behave “rationally”
Contents of memory simply doesn’t change until you overwrite or erase them. When clients in a system maintain caches of a common memory resource, problems may arise with inconsistent data. This is particularly true of CPUs in a multiprocessing system. Referring to the illustration below, if the top client has a copy of a memory block from a previous read and the bottom client changes that memory block, the top client could be left with an invalid cache of memory without any notification of the change. Cache coherence is intended to manage such conflicts and maintain consistency between cache and memory.


Timing or ordering of writes means “when the rest of the system sees the results?”.

Consistency models:
The data consistency model specifies a contract between programmer and system, wherein the system guarantees that if the programmer follows the rules, memory will be consistent and the results of memory operations will be predictable.

1. Strict consistency: the value read by a processor is always the most recently written value. Makes perfect sense for a uniprocessor, but creates a problem in multiprocessors.
When we have two processors A and B. If A writes a value and B reads same value, the soonest it can be made available is determined by the latency. (light cone originating at A touches B at a point, which is the soonest data availability timeline for B).
Thus, strict consistency is never used for multiprocessors.

2. Sequential consistency: this is what everybody does.
a. Memory operations are performed in program order. =>if you write load before store, load actually happens before store.
b. All CPU’s see the same global ordering.
c. The global ordering (interleaving) can be anything.
Implication: if machine A sees a particular order, then everyone in the system see the same order.
d. All I/O must go through the memory system (the coherence system)
e. You can’t buffer writes.
f. If you have write A, read B, read C, read D, then you can’t buffer write A and start performing the reads. This is because, when you do that, the world sees BCDA. But the actual global order is supposed to be ABCD.
g. Reordering operations screws things up.

3. Processor consistency: also called total store order.
a. The only ordering is of writes
b. Reads can occur in any order

Cache coherence:

SI: Shared/Invalid:
only works with write through caches. This is because, in write through caches, the backing store always has the data which is consistent with all the other caches and can be evicted safely.
Data block can be in one of the two stages: shared (cannot write) or invalid (not present).
On read: get a shared copy from memory or any other cache
On write:
1. Write update: everytime you write to memory, you also update all the other caches with the same value.
2. Write invalidate: everytime you write to memory, you ask all other caches having this block to evict their copies and get a fresh copy from backing store.

MSI: Modified Shared Invalid: supports write back caches as well.
Each block within any cache can be in one of the three states:
Modified: the block in the cache has be written and is inconsistent with backing storage and other caches.
Shared: the block in this cache is a read only copy and can be evicted safely.
Invalid: the block is not available in the cache and should be present in the backing store or any other cache.
These coherency states are maintained through communication between the caches and the backing store. The caches have different responsibilities when blocks are read or written, or when they learn of other caches issuing reads or writes for a block.
When a read request arrives at a cache for a block in the "M" or "S" states, the cache supplies the data. If the block is not in the cache (in the "I" state), it must verify that the line is not in the "M" state in any other cache. Different caching architectures handle this differently. For example, bus architectures often perform snooping, where the read request is broadcast to all of the caches. Other architectures include cache directories which have agents (directories) that know which caches last had copies of a particular cache block. If another cache has the block in the "M" state, it must write back the data to the backing store and go to the "S" or "I" states. Once any "M" line is written back, the cache obtains the block from either the backing store, or another cache with the data in the "S" state. The cache can then supply the data to the requestor. After supplying the data, the cache block is in the "S" state.
When a write request arrives at a cache for a block in the "M" state, the cache modifies the data locally. If the block is in the "S" state, the cache must notify any other caches that might contain the block in the "S" state that they must evict the block. This notification may be via bus snooping or a directory, as described above. Then the data may be locally modified. If the block is in the "I" state, the cache must notify any other caches that might contain the block in the "S" or "M" states that they must evict the block. If the block is in another cache in the "M" state, that cache must either write the data to the backing store or supply it to the requesting cache. If at this point the cache does not yet have the block locally, the block is read from the backing store before being modified in the cache. After the data is modified, the cache block is in the "M" state.

MESI: Modified Exclusive Shared Invalid:
Every cache line is marked with one of the four following states (coded in two additional bits):
Modified
The cache line is present only in the current cache, and is dirty; it has been modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back changes the line to the Exclusive state.
Exclusive
The cache line is present only in the current cache(in the whole system, it is present only in the current cache), but is clean; it matches main memory. It may be changed to the Shared state at any time, in response to a read request. Alternatively, it may be changed to the Modified state when writing to it.
Shared
Indicates that this cache line may be stored in other caches of the machine and is "clean"; it matches the main memory. The line may be discarded (changed to the Invalid state) at any time.
Invalid
Indicates that this cache line is invalid (unused).


DISCLAIMER: The above article is based on my notes from one of my graduate courses and major portion of it is from wikipedia.

Cheers,
Vijay.